智能论文笔记

Inference and dynamic decision-making for deteriorating systems with probabilistic dependencies through Bayesian networks and deep reinforcement learning

Pablo G. Morato , Charalampos P. Andriotis , Konstantinos G. Papakonstantinou , Philippe Rigo

分类：人工智能 | 机器学习 | (统计)机器学习

2022-09-02

在现代环境和社会问题的背景下，人们对能够识别土木工程系统的管理策略的方法的需求越来越大，最大程度地降低了结构性故障风险，同时最好计划检查和维护（I＆M）流程。由于与联合系统级状态描述下的全局优化方法相关的计算复杂性，大多数可用方法将I＆M决策问题简化为组件级别。在本文中，我们提出了一个有效的算法框架，用于在暴露于恶化环境的工程系统下进行推理和决策制定，从而直接在系统级别提供最佳的管理策略。在我们的方法中，决策问题被提出为部分可观察到的马尔可夫决策过程，其动态是在贝叶斯网络条件结构中编码的。该方法可以通过高斯层次结构和动态贝叶斯网络在组件之间平等或一般，不平等的恶化相关性下处理环境。在政策优化方面，我们采用了深层分散的多代理参与者 - 批评（DDMAC）强化学习方法，其中政策由批评家网络指导的参与者神经网络近似。通过在模拟环境中包括劣化依赖性，并通过在系统级别制定成本模型，DDMAC策略本质上考虑了基本系统效应。通过对疲劳恶化下的9分和钢架进行的数值实验证明了这一点。结果表明，与最先进的启发式方法相比，DDMAC政策可提供可观的好处。 DDMAC策略对系统效应的固有考虑也可以根据学习的政策来解释。

translated by 谷歌翻译

HTML版本

Optimal Inspection and Maintenance Planning for Deteriorating Structural Components through Dynamic Bayesian Networks and Markov Decision Processes

P. G. Morato , K. G. Papakonstantinou , C. P. Andriotis , J. S. Nielsen , P. Rigo

分类：人工智能

2020-09-09

在桥梁到海上平台和风力涡轮机的公民和海上工程系统必须有效地管理，因为它们在其运行寿命中暴露于劣化机制，例如疲劳或腐蚀。确定最佳检查和维护政策要求在不确定性下解决复杂的连续决策问题，主要目的是有效地控制与结构失败相关的风险。解决这种复杂性，基于风险的检查计划方法，通常由动态贝叶斯网络支持，评估一组预定义的启发式决策规则，以合理简化了决策问题。然而，所产生的政策可能受到决策规则定义中考虑的有限空间的损害。避免这种限制，部分观察到的马尔可夫决策过程（POMDPS）在不确定的动作结果和观察下提供了用于随机最佳控制的原则性的数学方法，其中作为整个动态更新的状态概率分布的函数规定了最佳动作。在本文中，我们将动态贝叶斯网络与POMDPS结合在联合框架中，以获得最佳检查和维护计划，我们提供了在结构可靠性背景下开发无限和有限地平线POMDP的配方。所提出的方法是对结构部件进行疲劳劣化的情况的情况下实施和测试，证明了基于最先进的POMDP求解器的能力，用于解决潜在的规划优化问题。在数值实验中，彻底比较了POMDP和基于启发式的策略，并且结果表明POMDP与对应于传统问题设置相比，POMDP达到了大幅降低的成本。

translated by 谷歌翻译

Bridging POMDPs and Bayesian decision making for robust maintenance planning under model uncertainty: An application to railway systems

Giacomo Arcieri , Cyprien Hoelzl , Oliver Schwery , Daniel Straub , Konstantinos G. Papakonstantinou , Eleni Chatzi

分类：人工智能 | 机器学习

2022-12-15

Structural Health Monitoring (SHM) describes a process for inferring quantifiable metrics of structural condition, which can serve as input to support decisions on the operation and maintenance of infrastructure assets. Given the long lifespan of critical structures, this problem can be cast as a sequential decision making problem over prescribed horizons. Partially Observable Markov Decision Processes (POMDPs) offer a formal framework to solve the underlying optimal planning task. However, two issues can undermine the POMDP solutions. Firstly, the need for a model that can adequately describe the evolution of the structural condition under deterioration or corrective actions and, secondly, the non-trivial task of recovery of the observation process parameters from available monitoring data. Despite these potential challenges, the adopted POMDP models do not typically account for uncertainty on model parameters, leading to solutions which can be unrealistically confident. In this work, we address both key issues. We present a framework to estimate POMDP transition and observation model parameters directly from available data, via Markov Chain Monte Carlo (MCMC) sampling of a Hidden Markov Model (HMM) conditioned on actions. The MCMC inference estimates distributions of the involved model parameters. We then form and solve the POMDP problem by exploiting the inferred distributions, to derive solutions that are robust to model uncertainty. We successfully apply our approach on maintenance planning for railway track assets on the basis of a "fractal value" indicator, which is computed from actual railway monitoring data.

translated by 谷歌翻译

Towards Causal Credit Assignment

Mátyás Schubert

分类：机器学习 | 人工智能

2022-12-22

Adequately assigning credit to actions for future outcomes based on their contributions is a long-standing open challenge in Reinforcement Learning. The assumptions of the most commonly used credit assignment method are disadvantageous in tasks where the effects of decisions are not immediately evident. Furthermore, this method can only evaluate actions that have been selected by the agent, making it highly inefficient. Still, no alternative methods have been widely adopted in the field. Hindsight Credit Assignment is a promising, but still unexplored candidate, which aims to solve the problems of both long-term and counterfactual credit assignment. In this thesis, we empirically investigate Hindsight Credit Assignment to identify its main benefits, and key points to improve. Then, we apply it to factored state representations, and in particular to state representations based on the causal structure of the environment. In this setting, we propose a variant of Hindsight Credit Assignment that effectively exploits a given causal structure. We show that our modification greatly decreases the workload of Hindsight Credit Assignment, making it more efficient and enabling it to outperform the baseline credit assignment method on various tasks. This opens the way to other methods based on given or learned causal structures.

translated by 谷歌翻译

Recent Advances in Reinforcement Learning in Finance

Ben Hambly , Renyuan Xu , Huining Yang

分类：机器学习

2021-12-08

由于数据量增加，金融业的快速变化已经彻底改变了数据处理和数据分析的技术，并带来了新的理论和计算挑战。与古典随机控制理论和解决财务决策问题的其他分析方法相比，解决模型假设的财务决策问题，强化学习（RL）的新发展能够充分利用具有更少模型假设的大量财务数据并改善复杂的金融环境中的决策。该调查纸目的旨在审查最近的资金途径的发展和使用RL方法。我们介绍了马尔可夫决策过程，这是许多常用的RL方法的设置。然后引入各种算法，重点介绍不需要任何模型假设的基于价值和基于策略的方法。连接是用神经网络进行的，以扩展框架以包含深的RL算法。我们的调查通过讨论了这些RL算法在金融中各种决策问题中的应用，包括最佳执行，投资组合优化，期权定价和对冲，市场制作，智能订单路由和Robo-Awaring。

translated by 谷歌翻译

Partially Observable Markov Decision Processes in Robotics: A Survey

Mikko Lauri , David Hsu , Joni Pajarinen

分类：机器人 | 人工智能

2022-09-21

嘈杂的传感，不完美的控制和环境变化是许多现实世界机器人任务的定义特征。部分可观察到的马尔可夫决策过程（POMDP）提供了一个原则上的数学框架，用于建模和解决不确定性下的机器人决策和控制任务。在过去的十年中，它看到了许多成功的应用程序，涵盖了本地化和导航，搜索和跟踪，自动驾驶，多机器人系统，操纵和人类机器人交互。这项调查旨在弥合POMDP模型的开发与算法之间的差距，以及针对另一端的不同机器人决策任务的应用。它分析了这些任务的特征，并将它们与POMDP框架的数学和算法属性联系起来，以进行有效的建模和解决方案。对于从业者来说，调查提供了一些关键任务特征，以决定何时以及如何成功地将POMDP应用于机器人任务。对于POMDP算法设计师，该调查为将POMDP应用于机器人系统的独特挑战提供了新的见解，并指出了有希望的新方向进行进一步研究。

translated by 谷歌翻译

Exploration in Deep Reinforcement Learning: A Comprehensive Survey

Tianpei Yang , Hongyao Tang , Chenjia Bai , Jinyi Liu , Jianye Hao , Zhaopeng Meng , Peng Liu , Zhen Wang

分类：人工智能 | 机器学习

2021-09-14

深度强化学习（DRL）和深度多机构的强化学习（MARL）在包括游戏AI，自动驾驶汽车，机器人技术等各种领域取得了巨大的成功。但是，众所周知，DRL和Deep MARL代理的样本效率低下，即使对于相对简单的问题设置，通常也需要数百万个相互作用，从而阻止了在实地场景中的广泛应用和部署。背后的一个瓶颈挑战是众所周知的探索问题，即如何有效地探索环境和收集信息丰富的经验，从而使政策学习受益于最佳研究。在稀疏的奖励，吵闹的干扰，长距离和非平稳的共同学习者的复杂环境中，这个问题变得更加具有挑战性。在本文中，我们对单格和多代理RL的现有勘探方法进行了全面的调查。我们通过确定有效探索的几个关键挑战开始调查。除了上述两个主要分支外，我们还包括其他具有不同思想和技术的著名探索方法。除了算法分析外，我们还对一组常用基准的DRL进行了全面和统一的经验比较。根据我们的算法和实证研究，我们终于总结了DRL和Deep Marl中探索的公开问题，并指出了一些未来的方向。

translated by 谷歌翻译

Design and Planning of Flexible Mobile Micro-Grids Using Deep Reinforcement Learning

Cesare Caputo , Michel-Alexandre Cardin , Pudong Ge , Fei Teng , Anna Korre , Ehecatl Antonio del Rio Chanona

分类：人工智能

2022-12-08

Ongoing risks from climate change have impacted the livelihood of global nomadic communities, and are likely to lead to increased migratory movements in coming years. As a result, mobility considerations are becoming increasingly important in energy systems planning, particularly to achieve energy access in developing countries. Advanced Plug and Play control strategies have been recently developed with such a decentralized framework in mind, more easily allowing for the interconnection of nomadic communities, both to each other and to the main grid. In light of the above, the design and planning strategy of a mobile multi-energy supply system for a nomadic community is investigated in this work. Motivated by the scale and dimensionality of the associated uncertainties, impacting all major design and decision variables over the 30-year planning horizon, Deep Reinforcement Learning (DRL) is implemented for the design and planning problem tackled. DRL based solutions are benchmarked against several rigid baseline design options to compare expected performance under uncertainty. The results on a case study for ger communities in Mongolia suggest that mobile nomadic energy systems can be both technically and economically feasible, particularly when considering flexibility, although the degree of spatial dispersion among households is an important limiting factor. Key economic, sustainability and resilience indicators such as Cost, Equivalent Emissions and Total Unmet Load are measured, suggesting potential improvements compared to available baselines of up to 25%, 67% and 76%, respectively. Finally, the decomposition of values of flexibility and plug and play operation is presented using a variation of real options theory, with important implications for both nomadic communities and policymakers focused on enabling their energy access.

translated by 谷歌翻译

Combining information-seeking exploration and reward maximization: Unified inference on continuous state and action spaces under partial observability

Parvin Malekzadeh , Konstantinos N. Plataniotis

分类：机器学习 | 人工智能

2022-12-15

Reinforcement learning (RL) gained considerable attention by creating decision-making agents that maximize rewards received from fully observable environments. However, many real-world problems are partially or noisily observable by nature, where agents do not receive the true and complete state of the environment. Such problems are formulated as partially observable Markov decision processes (POMDPs). Some studies applied RL to POMDPs by recalling previous decisions and observations or inferring the true state of the environment from received observations. Nevertheless, aggregating observations and decisions over time is impractical for environments with high-dimensional continuous state and action spaces. Moreover, so-called inference-based RL approaches require large number of samples to perform well since agents eschew uncertainty in the inferred state for the decision-making. Active inference is a framework that is naturally formulated in POMDPs and directs agents to select decisions by minimising expected free energy (EFE). This supplies reward-maximising (exploitative) behaviour in RL, with an information-seeking (exploratory) behaviour. Despite this exploratory behaviour of active inference, its usage is limited to discrete state and action spaces due to the computational difficulty of the EFE. We propose a unified principle for joint information-seeking and reward maximization that clarifies a theoretical connection between active inference and RL, unifies active inference and RL, and overcomes their aforementioned limitations. Our findings are supported by strong theoretical analysis. The proposed framework's superior exploration property is also validated by experimental results on partial observable tasks with high-dimensional continuous state and action spaces. Moreover, the results show that our model solves reward-free problems, making task reward design optional.

translated by 谷歌翻译

Macro-Action-Based Multi-Agent/Robot Deep Reinforcement Learning under Partial Observability

Yuchen Xiao

分类：人工智能 | 机器人

2022-09-20

最先进的多机构增强学习（MARL）方法为各种复杂问题提供了有希望的解决方案。然而，这些方法都假定代理执行同步的原始操作执行，因此它们不能真正可扩展到长期胜利的真实世界多代理/机器人任务，这些任务固有地要求代理/机器人以异步的理由，涉及有关高级动作选择的理由。不同的时间。宏观行动分散的部分可观察到的马尔可夫决策过程（MACDEC-POMDP）是在完全合作的多代理任务中不确定的异步决策的一般形式化。在本论文中，我们首先提出了MacDec-Pomdps的一组基于价值的RL方法，其中允许代理在三个范式中使用宏观成果功能执行异步学习和决策：分散学习和控制，集中学习，集中学习和控制，以及分散执行的集中培训（CTDE）。在上述工作的基础上，我们在三个训练范式下制定了一组基于宏观行动的策略梯度算法，在该训练范式下，允许代理以异步方式直接优化其参数化策略。我们在模拟和真实的机器人中评估了我们的方法。经验结果证明了我们在大型多代理问题中的方法的优势，并验证了我们算法在学习具有宏观actions的高质量和异步溶液方面的有效性。

translated by 谷歌翻译

Monte Carlo Tree Search: A Review of Recent Modifications and Applications

Maciej Świechowski , Konrad Godlewski , Bartosz Sawicki , Jacek Mańdziuk

分类：人工智能 | 机器学习

2021-03-08

蒙特卡洛树搜索（MCT）是设计游戏机器人或解决顺序决策问题的强大方法。该方法依赖于平衡探索和开发的智能树搜索。MCT以模拟的形式进行随机抽样，并存储动作的统计数据，以在每个随后的迭代中做出更有教育的选择。然而，该方法已成为组合游戏的最新技术，但是，在更复杂的游戏（例如那些具有较高的分支因素或实时系列的游戏）以及各种实用领域（例如，运输，日程安排或安全性）有效的MCT应用程序通常需要其与问题有关的修改或与其他技术集成。这种特定领域的修改和混合方法是本调查的主要重点。最后一项主要的MCT调查已于2012年发布。自发布以来出现的贡献特别感兴趣。

translated by 谷歌翻译

Generalized Reinforcement Learning: Experience Particles, Action Operator, Reinforcement Field, Memory Association, and Decision Concepts

Po-Hsiang Chiu , Manfred Huber

分类：机器学习 | 人工智能

2022-08-09

学习涉及时变和不断发展的系统动态的控制政策通常对主流强化学习算法构成了巨大的挑战。在大多数标准方法中，通常认为动作是一组刚性的，固定的选择，这些选择以预定义的方式顺序应用于状态空间。因此，在不诉诸于重大学习过程的情况下，学识渊博的政策缺乏适应动作集和动作的“行为”结果的能力。此外，标准行动表示和动作引起的状态过渡机制固有地限制了如何将强化学习应用于复杂的现实世界应用中，这主要是由于所得大的状态空间的棘手性以及缺乏概括的学术知识对国家空间未知部分的政策。本文提出了一个贝叶斯味的广义增强学习框架，首先建立参数动作模型的概念，以更好地应对不确定性和流体动作行为，然后将增强领域的概念作为物理启发的结构引入通过“极化体验颗粒颗粒建立） “维持在学习代理的工作记忆中。这些粒子有效地编码了以自组织方式随时间演变的动态学习体验。在强化领域之上，我们将进一步概括策略学习过程，以通过将过去的记忆视为具有隐式图结构来结合高级决策概念，在该结构中，过去的内存实例（或粒子）与决策之间的相似性相互联系。定义，因此，可以应用“关联记忆”原则来增强学习代理的世界模型。

translated by 谷歌翻译

Deep reinforcement learning: A brief survey

分类：

Deep reinforcement learning is poised to revolutionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policybased methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep Q-network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning. To conclude, we describe several current areas of research within the field.

translated by 谷歌翻译

Active Inference Tree Search in Large POMDPs

Domenico Maisto , Francesco Gregoretti , Karl Friston , Giovanni Pezzulo

分类：人工智能

2021-03-25

有效计划的能力对于生物体和人造系统都是至关重要的。在认知神经科学和人工智能（AI）中广泛研究了基于模型的计划和假期，但是从不同的角度来看，以及难以调和的考虑（生物现实主义与可伸缩性）的不同意见（生物现实主义与可伸缩性）。在这里，我们介绍了一种新颖的方法来计划大型POMDP（Active Tree search（ACT）），该方法结合了神经科学中领先的计划理论的规范性特征和生物学现实主义（主动推论）和树木搜索方法的可扩展性AI。这种统一对两种方法都是有益的。一方面，使用树搜索可以使生物学接地的第一原理，主动推断的方法可应用于大规模问题。另一方面，主动推理为探索 - 开发困境提供了一种原则性的解决方案，该解决方案通常在树搜索方法中以启发性解决。我们的模拟表明，ACT成功地浏览了对基于抽样的方法，需要自适应探索的问题以及大型POMDP问题“ RockSample”的二进制树，其中ACT近似于最新的POMDP解决方案。此外，我们说明了如何使用ACT来模拟人类和其他解决大型计划问题的人类和其他动物的神经生理反应（例如，在海马和前额叶皮层）。这些数值分析表明，主动树搜索是神经科学和AI计划理论的原则性实现，既具有生物现实主义和可扩展性。

translated by 谷歌翻译

Automated Reinforcement Learning: An Overview

Reza Refaei Afshar , Yingqian Zhang , Joaquin Vanschoren , Uzay Kaymak

分类：机器学习 | 人工智能

2022-01-13

强化学习和最近的深度增强学习是解决如Markov决策过程建模的顺序决策问题的流行方法。问题和选择算法和超参数的RL建模需要仔细考虑，因为不同的配置可能需要完全不同的性能。这些考虑因素主要是RL专家的任务;然而，RL在研究人员和系统设计师不是RL专家的其他领域中逐渐变得流行。此外，许多建模决策，例如定义状态和动作空间，批次的大小和批量更新的频率以及时间戳的数量通常是手动进行的。由于这些原因，RL框架的自动化不同组成部分具有重要意义，近年来它引起了很多关注。自动RL提供了一个框架，其中RL的不同组件包括MDP建模，算法选择和超参数优化是自动建模和定义的。在本文中，我们探讨了可以在自动化RL中使用的文献和目前的工作。此外，我们讨论了Autorl中的挑战，打开问题和研究方向。

translated by 谷歌翻译

A Maintenance Planning Framework using Online and Offline Deep Reinforcement Learning

Zaharah A. Bukhsh , Nils Jansen , Hajo Molegraaf

分类：机器学习 | 人工智能

2022-08-01

具有成本效益的资产管理是多个行业的兴趣领域。具体而言，本文开发了深入的加固学习（DRL）解决方案，以自动确定不断恶化的水管的最佳康复政策。我们在在线和离线DRL设置中处理康复计划的问题。在在线DRL中，代理与具有不同长度，材料和故障率特征的多个管道的模拟环境进行交互。我们使用深Q学习（DQN）训练代理商，以最低限度的平均成本和减少故障概率学习最佳政策。在离线学习中，代理使用静态数据，例如DQN重播数据，通过保守的Q学习算法学习最佳策略，而无需与环境进行进一步的交互。我们证明，基于DRL的政策改善了标准预防，纠正和贪婪的计划替代方案。此外，从固定的DQN重播数据集中学习超过在线DQN设置。结果保证，由大型国家和行动轨迹组成的水管的现有恶化概况为在离线环境中学习康复政策提供了宝贵的途径，而无需模拟器。

translated by 谷歌翻译

A Survey on Large-Population Systems and Scalable Multi-Agent Reinforcement Learning

Kai Cui , Anam Tahir , Gizem Ekinci , Ahmed Elshamanhory , Yannick Eich , Mengguang Li , Heinz Koeppl

分类：人工智能 | 机器学习

2022-09-08

大型人口系统的分析和控制对研究和工程的各个领域引起了极大的兴趣，从机器人群的流行病学到经济学和金融。一种越来越流行和有效的方法来实现多代理系统中的顺序决策，这是通过多机构增强学习，因为它允许对高度复杂的系统进行自动和无模型的分析。但是，可伸缩性的关键问题使控制和增强学习算法的设计变得复杂，尤其是在具有大量代理的系统中。尽管强化学习在许多情况下都发现了经验成功，但许多代理商的问题很快就变得棘手了，需要特别考虑。在这项调查中，我们将阐明当前的方法，以通过多代理强化学习以及通过诸如平均场游戏，集体智能或复杂的网络理论等研究领域进行仔细理解和分析大型人口系统。这些经典独立的主题领域提供了多种理解或建模大型人口系统的方法，这可能非常适合将来的可拖动MARL算法制定。最后，我们调查了大规模控制的潜在应用领域，并确定了实用系统中学习算法的富有成果的未来应用。我们希望我们的调查可以为理论和应用科学的初级和高级研究人员提供洞察力和未来的方向。

translated by 谷歌翻译

Deep Reinforcement Learning for Autonomous Driving: A Survey

B Ravi Kiran , Ibrahim Sobh , Victor Talpaert , Patrick Mannion , Ahmad A. Al Sallab , Senthil Yogamani , Patrick Pérez

分类：

2020-02-02

With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computational challenges in real world deployment of autonomous driving agents. It also delineates adjacent domains such as behavior cloning, imitation learning, inverse reinforcement learning that are related but are not classical RL algorithms. The role of simulators in training agents, methods to validate, test and robustify existing solutions in RL are discussed.

translated by 谷歌翻译

Active Inference in Robotics and Artificial Agents: Survey and Challenges

Pablo Lanillos , Cristian Meo , Corrado Pezzato , Ajith Anil Meera , Mohamed Baioumy , Wataru Ohata , Alexander Tschantz , Beren Millidge , Martijn Wisse , Christopher L. Buckley

分类：机器人 | 人工智能 | 机器学习

2021-12-03

有效推论是一种数学框架，它起源于计算神经科学，作为大脑如何实现动作，感知和学习的理论。最近，已被证明是在不确定性下存在国家估算和控制问题的有希望的方法，以及一般的机器人和人工代理人的目标驱动行为的基础。在这里，我们审查了最先进的理论和对国家估计，控制，规划和学习的积极推断的实现;描述当前的成就，特别关注机器人。我们展示了相关实验，以适应，泛化和稳健性而言说明其潜力。此外，我们将这种方法与其他框架联系起来，并讨论其预期的利益和挑战：使用变分贝叶斯推理具有功能生物合理性的统一框架。

translated by 谷歌翻译

Reinforcement Learning: A Survey

L. P. Kaelbling , M. L. Littman , A. W. Moore

分类：

1996-05-01

This paper surveys the eld of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the eld and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but di ers considerably in the details and in the use of the word \reinforcement." The paper discusses central issues of reinforcement learning, including trading o exploration and exploitation, establishing the foundations of the eld via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.

translated by 谷歌翻译