深度加强学习(DEEPRL)方法已广泛用于机器人学,以了解环境,自主获取行为。深度互动强化学习(Deepirl)包括来自外部培训师或专家的互动反馈,提供建议,帮助学习者选择采取行动以加快学习过程。但是,目前的研究仅限于仅为特工现任提供可操作建议的互动。另外,在单个使用之后,代理丢弃该信息,该用途在为Revisit以相同状态引起重复过程。在本文中,我们提出了广泛的建议(BPA),这是一种广泛的持久的咨询方法,可以保留并重新使用加工信息。它不仅可以帮助培训师提供与类似状态相关的更一般性建议,而不是仅仅是当前状态,而且还允许代理加快学习过程。我们在两个连续机器人场景中测试提出的方法,即购物车极衡任务和模拟机器人导航任务。所得结果表明,使用BPA的代理的性能在于与深层方法相比保持培训师所需的相互作用的数量。
translated by 谷歌翻译
交互式增强学习建议使用外部信息,以加快学习过程。当与学习者互动时,人类可以提供评估或有益的建议。先前的研究通过在交互式增强学习过程中包括实时反馈,专门旨在提高代理商的学习速度,同时最大程度地减少对人类的时间的需求,从而重点关注人类建议的效果。这项工作重点是回答两种评估或信息性的方法中的哪种是人类的首选教学方法。此外,这项工作为人类试验提供了实验设置,旨在比较人们用来提供人类参与建议的方法。获得的结果表明,向学习者提供信息的用户提供了更准确的建议,愿意在更长的时间内为学习者提供帮助,并每集提供更多建议。此外,使用信息丰富的方法的参与者的自我评估表明,与提供评估建议的人相比,代理商遵循建议的能力更高,因此,他们认为自己的建议的准确性更高。
translated by 谷歌翻译
In recent years, unmanned aerial vehicle (UAV) related technology has expanded knowledge in the area, bringing to light new problems and challenges that require solutions. Furthermore, because the technology allows processes usually carried out by people to be automated, it is in great demand in industrial sectors. The automation of these vehicles has been addressed in the literature, applying different machine learning strategies. Reinforcement learning (RL) is an automation framework that is frequently used to train autonomous agents. RL is a machine learning paradigm wherein an agent interacts with an environment to solve a given task. However, learning autonomously can be time consuming, computationally expensive, and may not be practical in highly-complex scenarios. Interactive reinforcement learning allows an external trainer to provide advice to an agent while it is learning a task. In this study, we set out to teach an RL agent to control a drone using reward-shaping and policy-shaping techniques simultaneously. Two simulated scenarios were proposed for the training; one without obstacles and one with obstacles. We also studied the influence of each technique. The results show that an agent trained simultaneously with both techniques obtains a lower reward than an agent trained using only a policy-based approach. Nevertheless, the agent achieves lower execution times and less dispersion during training.
translated by 谷歌翻译
随着自动驾驶行业的发展,自动驾驶汽车群体的潜在相互作用也随之增长。结合人工智能和模拟的进步,可以模拟此类组,并且可以学习控制内部汽车的安全模型。这项研究将强化学习应用于多代理停车场的问题,在那里,汽车旨在有效地停车,同时保持安全和理性。利用强大的工具和机器学习框架,我们以马尔可夫决策过程的形式与独立学习者一起设计和实施灵活的停车环境,从而利用多代理通信。我们实施了一套工具来进行大规模执行实验,从而取得了超过98.1%成功率的高达7辆汽车的模型,从而超过了现有的单代机构模型。我们还获得了与汽车在我们环境中表现出的竞争性和协作行为有关的几个结果,这些行为的密度和沟通水平各不相同。值得注意的是,我们发现了一种没有竞争的合作形式,以及一种“泄漏”的合作形式,在没有足够状态的情况下,代理商进行了协作。这种工作在自动驾驶和车队管理行业中具有许多潜在的应用,并为将强化学习应用于多机构停车场提供了几种有用的技术和基准。
translated by 谷歌翻译
在这项工作中,我们提出了一种初步调查一种名为DYNA-T的新算法。在钢筋学习(RL)中,规划代理有自己的环境表示作为模型。要发现与环境互动的最佳政策,代理商会收集试验和错误时尚的经验。经验可用于学习更好的模型或直接改进价值函数和政策。通常是分离的,Dyna-Q是一种混合方法,在每次迭代,利用真实体验更新模型以及值函数,同时使用模拟数据从其模型中的应用程序进行行动。然而,规划过程是计算昂贵的并且强烈取决于国家行动空间的维度。我们建议在模拟体验上构建一个上置信树(UCT),并在在线学习过程中搜索要选择的最佳动作。我们证明了我们提出的方法对来自Open AI的三个测试平台环境的一系列初步测试的有效性。与Dyna-Q相比,Dyna-T通过选择更强大的动作选择策略来优于随机环境中的最先进的RL代理。
translated by 谷歌翻译
The reinforcement learning paradigm is a popular way to address problems that have only limited environmental feedback, rather than correctly labeled examples, as is common in other machine learning contexts. While significant progress has been made to improve learning in a single task, the idea of transfer learning has only recently been applied to reinforcement learning tasks. The core idea of transfer is that experience gained in learning to perform one task can help improve learning performance in a related, but different, task. In this article we present a framework that classifies transfer learning methods in terms of their capabilities and goals, and then use it to survey the existing literature, as well as to suggest future directions for transfer learning work.
translated by 谷歌翻译
如今,合作多代理系统用于学习如何在大规模动态环境中实现目标。然而,在这些环境中的学习是具有挑战性的:从搜索空间大小对学习时间的影响,代理商之间的低效合作。此外,增强学习算法可能遭受这种环境的长时间的收敛。本文介绍了通信框架。在拟议的沟通框架中,代理商学会有效地合作,同时通过引入新的状态计算方法,状态空间的大小将大大下降。此外,提出了一种知识传输算法以共享不同代理商之间的获得经验,并制定有效的知识融合机制,以融合利用来自其他团队成员所收到的知识的代理商自己的经验。最后,提供了模拟结果以指示所提出的方法在复杂学习任务中的功效。我们已经评估了我们对牧羊化问题的方法,结果表明,通过利用知识转移机制,学习过程加速了,通过基于状态抽象概念产生类似国家的状态空间的大小均下降。
translated by 谷歌翻译
Safe Reinforcement Learning can be defined as the process of learning policies that maximize the expectation of the return in problems in which it is important to ensure reasonable system performance and/or respect safety constraints during the learning and/or deployment processes. We categorize and analyze two approaches of Safe Reinforcement Learning. The first is based on the modification of the optimality criterion, the classic discounted finite/infinite horizon, with a safety factor. The second is based on the modification of the exploration process through the incorporation of external knowledge or the guidance of a risk metric. We use the proposed classification to survey the existing literature, as well as suggesting future directions for Safe Reinforcement Learning.
translated by 谷歌翻译
With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computational challenges in real world deployment of autonomous driving agents. It also delineates adjacent domains such as behavior cloning, imitation learning, inverse reinforcement learning that are related but are not classical RL algorithms. The role of simulators in training agents, methods to validate, test and robustify existing solutions in RL are discussed.
translated by 谷歌翻译
Imitation learning techniques aim to mimic human behavior in a given task. An agent (a learning machine) is trained to perform a task from demonstrations by learning a mapping between observations and actions. The idea of teaching by imitation has been around for many years, however, the field is gaining attention recently due to advances in computing and sensing as well as rising demand for intelligent applications. The paradigm of learning by imitation is gaining popularity because it facilitates teaching complex tasks with minimal expert knowledge of the tasks. Generic imitation learning methods could potentially reduce the problem of teaching a task to that of providing demonstrations; without the need for explicit programming or designing reward functions specific to the task. Modern sensors are able to collect and transmit high volumes of data rapidly, and processors with high computational power allow fast processing that maps the sensory data to actions in a timely manner. This opens the door for many potential AI applications that require real-time perception and reaction such as humanoid robots, self-driving vehicles, human computer interaction and computer games to name a few. However, specialized algorithms are needed to effectively and robustly learn models as learning by imitation poses its own set of challenges. In this paper, we survey imitation learning methods and present design options in different steps of the learning process. We introduce a background and motivation for the field as well as highlight challenges specific to the imitation problem. Methods for designing and evaluating imitation learning tasks are categorized and reviewed. Special attention is given to learning methods in robotics and games as these domains are the most popular in the literature and provide a wide array of problems and methodologies. We extensively discuss combining imitation learning approaches using different sources and methods, as well as incorporating other motion learning methods to enhance imitation. We also discuss the potential impact on industry, present major applications and highlight current and future research directions.
translated by 谷歌翻译
在过去的十年中,多智能经纪人强化学习(Marl)已经有了重大进展,但仍存在许多挑战,例如高样本复杂性和慢趋同稳定的政策,在广泛的部署之前需要克服,这是可能的。然而,在实践中,许多现实世界的环境已经部署了用于生成策略的次优或启发式方法。一个有趣的问题是如何最好地使用这些方法作为顾问,以帮助改善多代理领域的加强学习。在本文中,我们提供了一个原则的框架,用于将动作建议纳入多代理设置中的在线次优顾问。我们描述了在非传记通用随机游戏环境中提供多种智能强化代理(海军上将)的问题,并提出了两种新的基于Q学习的算法:海军上将决策(海军DM)和海军上将 - 顾问评估(Admiral-AE) ,这使我们能够通过适当地纳入顾问(Admiral-DM)的建议来改善学习,并评估顾问(Admiral-AE)的有效性。我们从理论上分析了算法,并在一般加上随机游戏中提供了关于他们学习的定点保证。此外,广泛的实验说明了这些算法:可以在各种环境中使用,具有对其他相关基线的有利相比的性能,可以扩展到大状态行动空间,并且对来自顾问的不良建议具有稳健性。
translated by 谷歌翻译
Efficient use of the space in an elevator is very necessary for a service robot, due to the need for reducing the amount of time caused by waiting for the next elevator. To provide a solution for this, we propose a hybrid approach that combines reinforcement learning (RL) with voice interaction for robot navigation in the scene of entering the elevator. RL provides robots with a high exploration ability to find a new clear path to enter the elevator compared to traditional navigation methods such as Optimal Reciprocal Collision Avoidance (ORCA). The proposed method allows the robot to take an active clear path action towards the elevator whilst a crowd of people stands at the entrance of the elevator wherein there are still lots of space. This is done by embedding a clear path action (voice prompt) into the RL framework, and the proposed navigation policy helps the robot to finish tasks efficiently and safely. Our model approach provides a great improvement in the success rate and reward of entering the elevator compared to state-of-the-art navigation policies without active clear path operation.
translated by 谷歌翻译
Deep reinforcement learning is poised to revolutionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policybased methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep Q-network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning. To conclude, we describe several current areas of research within the field.
translated by 谷歌翻译
最先进的多机构增强学习(MARL)方法为各种复杂问题提供了有希望的解决方案。然而,这些方法都假定代理执行同步的原始操作执行,因此它们不能真正可扩展到长期胜利的真实世界多代理/机器人任务,这些任务固有地要求代理/机器人以异步的理由,涉及有关高级动作选择的理由。不同的时间。宏观行动分散的部分可观察到的马尔可夫决策过程(MACDEC-POMDP)是在完全合作的多代理任务中不确定的异步决策的一般形式化。在本论文中,我们首先提出了MacDec-Pomdps的一组基于价值的RL方法,其中允许代理在三个范式中使用宏观成果功能执行异步学习和决策:分散学习和控制,集中学习,集中学习和控制,以及分散执行的集中培训(CTDE)。在上述工作的基础上,我们在三个训练范式下制定了一组基于宏观行动的策略梯度算法,在该训练范式下,允许代理以异步方式直接优化其参数化策略。我们在模拟和真实的机器人中评估了我们的方法。经验结果证明了我们在大型多代理问题中的方法的优势,并验证了我们算法在学习具有宏观actions的高质量和异步溶液方面的有效性。
translated by 谷歌翻译
Adequately assigning credit to actions for future outcomes based on their contributions is a long-standing open challenge in Reinforcement Learning. The assumptions of the most commonly used credit assignment method are disadvantageous in tasks where the effects of decisions are not immediately evident. Furthermore, this method can only evaluate actions that have been selected by the agent, making it highly inefficient. Still, no alternative methods have been widely adopted in the field. Hindsight Credit Assignment is a promising, but still unexplored candidate, which aims to solve the problems of both long-term and counterfactual credit assignment. In this thesis, we empirically investigate Hindsight Credit Assignment to identify its main benefits, and key points to improve. Then, we apply it to factored state representations, and in particular to state representations based on the causal structure of the environment. In this setting, we propose a variant of Hindsight Credit Assignment that effectively exploits a given causal structure. We show that our modification greatly decreases the workload of Hindsight Credit Assignment, making it more efficient and enabling it to outperform the baseline credit assignment method on various tasks. This opens the way to other methods based on given or learned causal structures.
translated by 谷歌翻译
强化学习和最近的深度增强学习是解决如Markov决策过程建模的顺序决策问题的流行方法。问题和选择算法和超参数的RL建模需要仔细考虑,因为不同的配置可能需要完全不同的性能。这些考虑因素主要是RL专家的任务;然而,RL在研究人员和系统设计师不是RL专家的其他领域中逐渐变得流行。此外,许多建模决策,例如定义状态和动作空间,批次的大小和批量更新的频率以及时间戳的数量通常是手动进行的。由于这些原因,RL框架的自动化不同组成部分具有重要意义,近年来它引起了很多关注。自动RL提供了一个框架,其中RL的不同组件包括MDP建模,算法选择和超参数优化是自动建模和定义的。在本文中,我们探讨了可以在自动化RL中使用的文献和目前的工作。此外,我们讨论了Autorl中的挑战,打开问题和研究方向。
translated by 谷歌翻译
Besides the recent impressive results on reinforcement learning (RL), safety is still one of the major research challenges in RL. RL is a machine-learning approach to determine near-optimal policies in Markov decision processes (MDPs). In this paper, we consider the setting where the safety-relevant fragment of the MDP together with a temporal logic safety specification is given and many safety violations can be avoided by planning ahead a short time into the future. We propose an approach for online safety shielding of RL agents. During runtime, the shield analyses the safety of each available action. For any action, the shield computes the maximal probability to not violate the safety specification within the next $k$ steps when executing this action. Based on this probability and a given threshold, the shield decides whether to block an action from the agent. Existing offline shielding approaches compute exhaustively the safety of all state-action combinations ahead of time, resulting in huge computation times and large memory consumption. The intuition behind online shielding is to compute at runtime the set of all states that could be reached in the near future. For each of these states, the safety of all available actions is analysed and used for shielding as soon as one of the considered states is reached. Our approach is well suited for high-level planning problems where the time between decisions can be used for safety computations and it is sustainable for the agent to wait until these computations are finished. For our evaluation, we selected a 2-player version of the classical computer game SNAKE. The game represents a high-level planning problem that requires fast decisions and the multiplayer setting induces a large state space, which is computationally expensive to analyse exhaustively.
translated by 谷歌翻译
分享自治是指使自治工人能够与人类合作的方法,以提高人类性能。然而,除了提高性能之外,它通常也可能是有益的,代理同时考虑保留用户的经验或合作满意度。为了解决这一额外目标,我们通过约束自主代理的干预次数来研究改进用户体验的方法。我们提出了两种无模型的加强学习方法,可以考虑到干预措施的艰难和软限制。我们表明,我们的方法不仅表现出现有的基线,而且还消除了手动调整黑匣子超参数,以控制援助水平。我们还提供了对干预情景的深入分析,以进一步照亮系统理解。
translated by 谷歌翻译
Safety is still one of the major research challenges in reinforcement learning (RL). In this paper, we address the problem of how to avoid safety violations of RL agents during exploration in probabilistic and partially unknown environments. Our approach combines automata learning for Markov Decision Processes (MDPs) and shield synthesis in an iterative approach. Initially, the MDP representing the environment is unknown. The agent starts exploring the environment and collects traces. From the collected traces, we passively learn MDPs that abstractly represent the safety-relevant aspects of the environment. Given a learned MDP and a safety specification, we construct a shield. For each state-action pair within a learned MDP, the shield computes exact probabilities on how likely it is that executing the action results in violating the specification from the current state within the next $k$ steps. After the shield is constructed, the shield is used during runtime and blocks any actions that induce a too large risk from the agent. The shielded agent continues to explore the environment and collects new data on the environment. Iteratively, we use the collected data to learn new MDPs with higher accuracy, resulting in turn in shields able to prevent more safety violations. We implemented our approach and present a detailed case study of a Q-learning agent exploring slippery Gridworlds. In our experiments, we show that as the agent explores more and more of the environment during training, the improved learned models lead to shields that are able to prevent many safety violations.
translated by 谷歌翻译
This work is an exploratory research concerned with determining in what way reinforcement learning can be used to predict optimal PID parameters for a robot designed for apple harvest. To study this, an algorithm called Advantage Actor Critic (A2C) is implemented on a simulated robot arm. The simulation primarily relies on the ROS framework. Experiments for tuning one actuator at a time and two actuators a a time are run, which both show that the model is able to predict PID gains that perform better than the set baseline. In addition, it is studied if the model is able to predict PID parameters based on where an apple is located. Initial tests show that the model is indeed able to adapt its predictions to apple locations, making it an adaptive controller.
translated by 谷歌翻译