强化学习(RL)为解决各种复杂的决策任务提供了新的机会。但是,现代的RL算法,例如,深Q学习是基于深层神经网络,在Edge设备上运行时的计算成本很高。在本文中,我们提出了QHD,一种高度增强的学习,它模仿了大脑特性,以实现健壮和实时学习。 QHD依靠轻巧的大脑启发模型来学习未知环境中的最佳政策。我们首先建立一个新颖的数学基础和编码模块,该模块将状态行动空间映射到高维空间中。因此,我们开发了一个高维回归模型,以近似Q值函数。 QHD驱动的代理通过比较每个可能动作的Q值来做出决定。我们评估了不同的RL培训批量和本地记忆能力对QHD学习质量的影响。我们的QHD也能够以微小的本地记忆能力在线学习,这与培训批量大小一样小。 QHD通过进一步降低记忆容量和批处理大小来提供实时学习。这使得QHD适用于在边缘环境中高效的增强学习,这对于支持在线和实时学习至关重要。我们的解决方案还支持少量的重播批量大小,与DQN相比,该批量的速度为12.3倍,同时确保质量损失最小。我们的评估显示了实时学习的QHD能力,比最先进的Deep RL算法提供了34.6倍的速度和更高的学习质量。
translated by 谷歌翻译
In this paper, we build on advances introduced by the Deep Q-Networks (DQN) approach to extend the multi-objective tabular Reinforcement Learning (RL) algorithm W-learning to large state spaces. W-learning algorithm can naturally solve the competition between multiple single policies in multi-objective environments. However, the tabular version does not scale well to environments with large state spaces. To address this issue, we replace underlying Q-tables with DQN, and propose an addition of W-Networks, as a replacement for tabular weights (W) representations. We evaluate the resulting Deep W-Networks (DWN) approach in two widely-accepted multi-objective RL benchmarks: deep sea treasure and multi-objective mountain car. We show that DWN solves the competition between multiple policies while outperforming the baseline in the form of a DQN solution. Additionally, we demonstrate that the proposed algorithm can find the Pareto front in both tested environments.
translated by 谷歌翻译
本文介绍了用于交易单一资产的双重Q网络算法,即E-MINI S&P 500连续期货合约。我们使用经过验证的设置作为我们环境的基础,并具有多个扩展。我们的贸易代理商的功能不断扩展,包括其他资产,例如商品,从而产生了四种型号。我们还应对环境条件,包括成本和危机。我们的贸易代理商首先接受了特定时间段的培训,并根据新数据进行了测试,并将其与长期策略(市场)进行了比较。我们分析了各种模型与样本中/样本外性能之间有关环境的差异。实验结果表明,贸易代理人遵循适当的行为。它可以将其政策调整为不同的情况,例如在存在交易成本时更广泛地使用中性位置。此外,净资产价值超过了基准的净值,代理商在测试集中的市场优于市场。我们使用DDQN算法对代理商在金融领域中的行为提供初步见解。这项研究的结果可用于进一步发展。
translated by 谷歌翻译
Deep reinforcement learning is poised to revolutionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policybased methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep Q-network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning. To conclude, we describe several current areas of research within the field.
translated by 谷歌翻译
Quantum Computing在古典计算机上解决困难的计算任务的显着改进承诺。然而,为实际使用设计量子电路不是琐碎的目标,并且需要专家级知识。为了帮助这一努力,提出了一种基于机器学习的方法来构建量子电路架构。以前的作品已经证明,经典的深度加强学习(DRL)算法可以成功构建量子电路架构而没有编码的物理知识。但是,这些基于DRL的作品不完全在更换设备噪声中的设置,从而需要大量的培训资源来保持RL模型最新。考虑到这一点,我们持续学习,以提高算法的性能。在本文中,我们介绍了深度Q-Learning(PPR-DQL)框架的概率策略重用来解决这个电路设计挑战。通过通过各种噪声模式进行数值模拟,我们证明了具有PPR的RL代理能够找到量子栅极序列,以比从划痕训练的代理更快地生成双量标铃声状态。所提出的框架是一般的,可以应用于其他量子栅极合成或控制问题 - 包括量子器件的自动校准。
translated by 谷歌翻译
新一代网络威胁的兴起要求更复杂和智能的网络防御解决方案,配备了能够学习在没有人力专家知识的情况下做出决策的自治代理。近年来提出了用于自动网络入侵任务的几种强化学习方法(例如,马尔可夫)。在本文中,我们介绍了一种新一代的网络入侵检测方法,将基于Q学习的增强学习与用于网络入侵检测的深馈前神经网络方法相结合。我们提出的深度Q-Learning(DQL)模型为网络环境提供了正在进行的自动学习能力,该网络环境可以使用自动试验误差方法检测不同类型的网络入侵,并连续增强其检测能力。我们提供涉及DQL模型的微调不同的超参数的细节,以获得更有效的自学。根据我们基于NSL-KDD数据集的广泛实验结果,我们确认折扣因子在250次训练中设定为0.001,产生了最佳的性能结果。我们的实验结果还表明,我们所提出的DQL在检测不同的入侵课程和优于其他类似的机器学习方法方面的高度有效。
translated by 谷歌翻译
Development of navigation algorithms is essential for the successful deployment of robots in rapidly changing hazardous environments for which prior knowledge of configuration is often limited or unavailable. Use of traditional path-planning algorithms, which are based on localization and require detailed obstacle maps with goal locations, is not possible. In this regard, vision-based algorithms hold great promise, as visual information can be readily acquired by a robot's onboard sensors and provides a much richer source of information from which deep neural networks can extract complex patterns. Deep reinforcement learning has been used to achieve vision-based robot navigation. However, the efficacy of these algorithms in environments with dynamic obstacles and high variation in the configuration space has not been thoroughly investigated. In this paper, we employ a deep Dyna-Q learning algorithm for room evacuation and obstacle avoidance in partially observable environments based on low-resolution raw image data from an onboard camera. We explore the performance of a robotic agent in environments containing no obstacles, convex obstacles, and concave obstacles, both static and dynamic. Obstacles and the exit are initialized in random positions at the start of each episode of reinforcement learning. Overall, we show that our algorithm and training approach can generalize learning for collision-free evacuation of environments with complex obstacle configurations. It is evident that the agent can navigate to a goal location while avoiding multiple static and dynamic obstacles, and can escape from a concave obstacle while searching for and navigating to the exit.
translated by 谷歌翻译
Hybrid FSO/RF system requires an efficient FSO and RF link switching mechanism to improve the system capacity by realizing the complementary benefits of both the links. The dynamics of network conditions, such as fog, dust, and sand storms compound the link switching problem and control complexity. To address this problem, we initiate the study of deep reinforcement learning (DRL) for link switching of hybrid FSO/RF systems. Specifically, in this work, we focus on actor-critic called Actor/Critic-FSO/RF and Deep-Q network (DQN) called DQN-FSO/RF for FSO/RF link switching under atmospheric turbulences. To formulate the problem, we define the state, action, and reward function of a hybrid FSO/RF system. DQN-FSO/RF frequently updates the deployed policy that interacts with the environment in a hybrid FSO/RF system, resulting in high switching costs. To overcome this, we lift this problem to ensemble consensus-based representation learning for deep reinforcement called DQNEnsemble-FSO/RF. The proposed novel DQNEnsemble-FSO/RF DRL approach uses consensus learned features representations based on an ensemble of asynchronous threads to update the deployed policy. Experimental results corroborate that the proposed DQNEnsemble-FSO/RF's consensus-learned features switching achieves better performance than Actor/Critic-FSO/RF, DQN-FSO/RF, and MyOpic for FSO/RF link switching while keeping the switching cost significantly low.
translated by 谷歌翻译
With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computational challenges in real world deployment of autonomous driving agents. It also delineates adjacent domains such as behavior cloning, imitation learning, inverse reinforcement learning that are related but are not classical RL algorithms. The role of simulators in training agents, methods to validate, test and robustify existing solutions in RL are discussed.
translated by 谷歌翻译
强化学习和最近的深度增强学习是解决如Markov决策过程建模的顺序决策问题的流行方法。问题和选择算法和超参数的RL建模需要仔细考虑,因为不同的配置可能需要完全不同的性能。这些考虑因素主要是RL专家的任务;然而,RL在研究人员和系统设计师不是RL专家的其他领域中逐渐变得流行。此外,许多建模决策,例如定义状态和动作空间,批次的大小和批量更新的频率以及时间戳的数量通常是手动进行的。由于这些原因,RL框架的自动化不同组成部分具有重要意义,近年来它引起了很多关注。自动RL提供了一个框架,其中RL的不同组件包括MDP建模,算法选择和超参数优化是自动建模和定义的。在本文中,我们探讨了可以在自动化RL中使用的文献和目前的工作。此外,我们讨论了Autorl中的挑战,打开问题和研究方向。
translated by 谷歌翻译
The high emission and low energy efficiency caused by internal combustion engines (ICE) have become unacceptable under environmental regulations and the energy crisis. As a promising alternative solution, multi-power source electric vehicles (MPS-EVs) introduce different clean energy systems to improve powertrain efficiency. The energy management strategy (EMS) is a critical technology for MPS-EVs to maximize efficiency, fuel economy, and range. Reinforcement learning (RL) has become an effective methodology for the development of EMS. RL has received continuous attention and research, but there is still a lack of systematic analysis of the design elements of RL-based EMS. To this end, this paper presents an in-depth analysis of the current research on RL-based EMS (RL-EMS) and summarizes the design elements of RL-based EMS. This paper first summarizes the previous applications of RL in EMS from five aspects: algorithm, perception scheme, decision scheme, reward function, and innovative training method. The contribution of advanced algorithms to the training effect is shown, the perception and control schemes in the literature are analyzed in detail, different reward function settings are classified, and innovative training methods with their roles are elaborated. Finally, by comparing the development routes of RL and RL-EMS, this paper identifies the gap between advanced RL solutions and existing RL-EMS. Finally, this paper suggests potential development directions for implementing advanced artificial intelligence (AI) solutions in EMS.
translated by 谷歌翻译
通过回顾一封来自情节记忆的过去的经验,可以通过回忆过去的经验来实现钢筋学习的样本效率。我们提出了一种新的基于模型的轨迹的集体记忆,解决了集体控制的当前限制。我们的记忆估计轨迹值,指导代理人朝着良好的政策。基于内存构建,我们通过动态混合控制统一模型的基于动态和习惯学习来构建互补学习模型,进入单个架构。实验表明,我们的模型可以比各种环境中的其他强力加强学习代理更快,更好地学习,包括随机和非马尔可夫环境。
translated by 谷歌翻译
A central problem in computational biophysics is protein structure prediction, i.e., finding the optimal folding of a given amino acid sequence. This problem has been studied in a classical abstract model, the HP model, where the protein is modeled as a sequence of H (hydrophobic) and P (polar) amino acids on a lattice. The objective is to find conformations maximizing H-H contacts. It is known that even in this reduced setting, the problem is intractable (NP-hard). In this work, we apply deep reinforcement learning (DRL) to the two-dimensional HP model. We can obtain the conformations of best known energies for benchmark HP sequences with lengths from 20 to 50. Our DRL is based on a deep Q-network (DQN). We find that a DQN based on long short-term memory (LSTM) architecture greatly enhances the RL learning ability and significantly improves the search process. DRL can sample the state space efficiently, without the need of manual heuristics. Experimentally we show that it can find multiple distinct best-known solutions per trial. This study demonstrates the effectiveness of deep reinforcement learning in the HP model for protein folding.
translated by 谷歌翻译
由于数据量增加,金融业的快速变化已经彻底改变了数据处理和数据分析的技术,并带来了新的理论和计算挑战。与古典随机控制理论和解决财务决策问题的其他分析方法相比,解决模型假设的财务决策问题,强化学习(RL)的新发展能够充分利用具有更少模型假设的大量财务数据并改善复杂的金融环境中的决策。该调查纸目的旨在审查最近的资金途径的发展和使用RL方法。我们介绍了马尔可夫决策过程,这是许多常用的RL方法的设置。然后引入各种算法,重点介绍不需要任何模型假设的基于价值和基于策略的方法。连接是用神经网络进行的,以扩展框架以包含深的RL算法。我们的调查通过讨论了这些RL算法在金融中各种决策问题中的应用,包括最佳执行,投资组合优化,期权定价和对冲,市场制作,智能订单路由和Robo-Awaring。
translated by 谷歌翻译
在这项工作中,我们提出并评估了一种新的增强学习方法,紧凑体验重放(编者),它使用基于相似转换集的复发的预测目标值的时间差异学习,以及基于两个转换的经验重放的新方法记忆。我们的目标是减少在长期累计累计奖励的经纪人培训所需的经验。它与强化学习的相关性与少量观察结果有关,即它需要实现类似于文献中的相关方法获得的结果,这通常需要数百万视频框架来培训ATARI 2600游戏。我们举报了在八个挑战街机学习环境(ALE)挑战游戏中,为仅10万帧的培训试验和大约25,000次迭代的培训试验中报告了培训试验。我们还在与基线的同一游戏中具有相同的实验协议的DQN代理呈现结果。为了验证从较少数量的观察结果近似于良好的政策,我们还将其结果与从啤酒的基准上呈现的数百万帧中获得的结果进行比较。
translated by 谷歌翻译
基于模型的强化学习有望通过学习环境中的中间模型来预测未来的相互作用,从而从与环境的互动较少的相互作用中学习最佳政策。当预测一系列相互作用时,限制预测范围的推出长度是关键的超参数,因为预测的准确性会降低远离真实体验的区域。结果,从长远来看,从长远来看,总体上更糟糕的政策。因此,超参数提供了质量和效率之间的权衡。在这项工作中,我们将调整推出长度调整为元级的顺序决策问题的问题构成了问题,该问题优化了基于模型的强化学习所学到的最终策略,鉴于环境相互作用的固定预算通过基于反馈动态调整超参数来调整超参数。从学习过程中,例如模型的准确性和互动的其余预算。我们使用无模型的深度强化学习来解决元级决策问题,并证明我们的方法在两个众所周知的强化学习环境上优于共同的启发式基准。
translated by 谷歌翻译
We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.
translated by 谷歌翻译
Drug dosing is an important application of AI, which can be formulated as a Reinforcement Learning (RL) problem. In this paper, we identify two major challenges of using RL for drug dosing: delayed and prolonged effects of administering medications, which break the Markov assumption of the RL framework. We focus on prolongedness and define PAE-POMDP (Prolonged Action Effect-Partially Observable Markov Decision Process), a subclass of POMDPs in which the Markov assumption does not hold specifically due to prolonged effects of actions. Motivated by the pharmacology literature, we propose a simple and effective approach to converting drug dosing PAE-POMDPs into MDPs, enabling the use of the existing RL algorithms to solve such problems. We validate the proposed approach on a toy task, and a challenging glucose control task, for which we devise a clinically-inspired reward function. Our results demonstrate that: (1) the proposed method to restore the Markov assumption leads to significant improvements over a vanilla baseline; (2) the approach is competitive with recurrent policies which may inherently capture the prolonged effect of actions; (3) it is remarkably more time and memory efficient than the recurrent baseline and hence more suitable for real-time dosing control systems; and (4) it exhibits favorable qualitative behavior in our policy analysis.
translated by 谷歌翻译
具有成本效益的资产管理是多个行业的兴趣领域。具体而言,本文开发了深入的加固学习(DRL)解决方案,以自动确定不断恶化的水管的最佳康复政策。我们在在线和离线DRL设置中处理康复计划的问题。在在线DRL中,代理与具有不同长度,材料和故障率特征的多个管道的模拟环境进行交互。我们使用深Q学习(DQN)训练代理商,以最低限度的平均成本和减少故障概率学习最佳政策。在离线学习中,代理使用静态数据,例如DQN重播数据,通过保守的Q学习算法学习最佳策略,而无需与环境进行进一步的交互。我们证明,基于DRL的政策改善了标准预防,纠正和贪婪的计划替代方案。此外,从固定的DQN重播数据集中学习超过在线DQN设置。结果保证,由大型国家和行动轨迹组成的水管的现有恶化概况为在离线环境中学习康复政策提供了宝贵的途径,而无需模拟器。
translated by 谷歌翻译