对于任何负责满足人类价值观或偏好的人工智能而言,平衡多个竞争和冲突目标是一项重要任务。冲突既是由于具有竞争价值的个体之间的错位而引起的,也是一个人之间的冲突价值体系之间的不对准。从规避损失的原则开始,我们设计了一组软目标决策的软最大化功能。在一组先前开发的环境中,板凳标记了这些功能,我们发现一种新的方法特别是“分裂功能exp-log averver over over over over over”(SFELLA),学习的速度比最先进的阈值对准目标方法\引用{vamplew_potential的_2021}对其进行了测试的四个任务中的三个,并在学习后达到了相同的最佳性能。 SFELLA还显示出相对鲁棒性的改善,以抵抗客观量表的变化,这可能突出了涉及环境动态分布变化的优势。必须从预印本中省略进一步的工作,但是在最终发布的版本中,我们将进一步将SFELLA与多目标奖励指数(更多)方法进行比较,表明SFELLA在简单的先前描述的觅食任务中的性能类似,但是,在经纪人工作时没有耗尽的新资源的经过修改的觅食环境中,SFELLA收集了更多的新资源,而在旧资源方面几乎没有成本。总体而言,我们发现SFELLA对于避免有时以阈值方法出现的问题而有用,并且在保留其保守的,避开逆转损失的激励结构的同时,比更多的奖励响应响应。
translated by 谷歌翻译
从演示和成对偏好推断奖励功能是对准与人类意图的强化学习(RL)代理的吉祥方法。然而,最先进的方法通常专注于学习单一奖励模型,从而使得难以从多个专家兑换不同的奖励功能。我们提出了多目标加强主动学习(道德),这是一种将社会规范多样化示范与帕累托最优政策相结合的新方法。通过维持分布在标量化权重,我们的方法能够以各种偏好交互地调整深度RL代理,同时消除了计算多个策略的需求。我们经验展示了道德在两种情景中的有效性,该方案模拟了需要代理人在规范冲突的情况下采取行动的交付和紧急任务。总体而言,我们认为我们的研究迈出了多目标RL的一步,具有学习奖励,弥合当前奖励学习和机器伦理文学之间的差距。
translated by 谷歌翻译
我们分析了学习型号(如神经网络)本身是优化器时发生的学习优化的类型 - 我们将作为MESA优化的情况,我们在本文中介绍的新闻。我们认为,MESA优化的可能性为先进机器学习系统的安全和透明度提出了两个重要问题。首先,在什么情况下学习模型是优化的,包括当他们不应该?其次,当学习模型是优化器时,它的目标是什么 - 它将如何与损失函数不同,它训练的损失 - 并且如何对齐?在本文中,我们对这两个主要问题进行了深入的分析,并提供了未来研究的主题概述。
translated by 谷歌翻译
如果未来的AI系统在新的情况下是可靠的安全性,那么他们将需要纳入指导它们的一般原则,以便强烈地认识到哪些结果和行为将是有害的。这样的原则可能需要得到约束力的监管制度的支持,该法规需要广泛接受的基本原则。它们还应该足够具体用于技术实施。本文从法律中汲取灵感,解释了负面的人权如何履行此类原则的作用,并为国际监管制度以及为未来的AI系统建立技术安全限制的基础。
translated by 谷歌翻译
Adequately assigning credit to actions for future outcomes based on their contributions is a long-standing open challenge in Reinforcement Learning. The assumptions of the most commonly used credit assignment method are disadvantageous in tasks where the effects of decisions are not immediately evident. Furthermore, this method can only evaluate actions that have been selected by the agent, making it highly inefficient. Still, no alternative methods have been widely adopted in the field. Hindsight Credit Assignment is a promising, but still unexplored candidate, which aims to solve the problems of both long-term and counterfactual credit assignment. In this thesis, we empirically investigate Hindsight Credit Assignment to identify its main benefits, and key points to improve. Then, we apply it to factored state representations, and in particular to state representations based on the causal structure of the environment. In this setting, we propose a variant of Hindsight Credit Assignment that effectively exploits a given causal structure. We show that our modification greatly decreases the workload of Hindsight Credit Assignment, making it more efficient and enabling it to outperform the baseline credit assignment method on various tasks. This opens the way to other methods based on given or learned causal structures.
translated by 谷歌翻译
We are currently unable to specify human goals and societal values in a way that reliably directs AI behavior. Law-making and legal interpretation form a computational engine that converts opaque human values into legible directives. "Law Informs Code" is the research agenda capturing complex computational legal processes, and embedding them in AI. Similar to how parties to a legal contract cannot foresee every potential contingency of their future relationship, and legislators cannot predict all the circumstances under which their proposed bills will be applied, we cannot ex ante specify rules that provably direct good AI behavior. Legal theory and practice have developed arrays of tools to address these specification problems. For instance, legal standards allow humans to develop shared understandings and adapt them to novel situations. In contrast to more prosaic uses of the law (e.g., as a deterrent of bad behavior through the threat of sanction), leveraged as an expression of how humans communicate their goals, and what society values, Law Informs Code. We describe how data generated by legal processes (methods of law-making, statutory interpretation, contract drafting, applications of legal standards, legal reasoning, etc.) can facilitate the robust specification of inherently vague human goals. This increases human-AI alignment and the local usefulness of AI. Toward society-AI alignment, we present a framework for understanding law as the applied philosophy of multi-agent alignment. Although law is partly a reflection of historically contingent political power - and thus not a perfect aggregation of citizen preferences - if properly parsed, its distillation offers the most legitimate computational comprehension of societal values available. If law eventually informs powerful AI, engaging in the deliberative political process to improve law takes on even more meaning.
translated by 谷歌翻译
Curiosity for machine agents has been a focus of lively research activity. The study of human and animal curiosity, particularly specific curiosity, has unearthed several properties that would offer important benefits for machine learners, but that have not yet been well-explored in machine intelligence. In this work, we conduct a comprehensive, multidisciplinary survey of the field of animal and machine curiosity. As a principal contribution of this work, we use this survey as a foundation to introduce and define what we consider to be five of the most important properties of specific curiosity: 1) directedness towards inostensible referents, 2) cessation when satisfied, 3) voluntary exposure, 4) transience, and 5) coherent long-term learning. As a second main contribution of this work, we show how these properties may be implemented together in a proof-of-concept reinforcement learning agent: we demonstrate how the properties manifest in the behaviour of this agent in a simple non-episodic grid-world environment that includes curiosity-inducing locations and induced targets of curiosity. As we would hope, our example of a computational specific curiosity agent exhibits short-term directed behaviour while updating long-term preferences to adaptively seek out curiosity-inducing situations. This work, therefore, presents a landmark synthesis and translation of specific curiosity to the domain of machine learning and reinforcement learning and provides a novel view into how specific curiosity operates and in the future might be integrated into the behaviour of goal-seeking, decision-making computational agents in complex environments.
translated by 谷歌翻译
Safe Reinforcement Learning can be defined as the process of learning policies that maximize the expectation of the return in problems in which it is important to ensure reasonable system performance and/or respect safety constraints during the learning and/or deployment processes. We categorize and analyze two approaches of Safe Reinforcement Learning. The first is based on the modification of the optimality criterion, the classic discounted finite/infinite horizon, with a safety factor. The second is based on the modification of the exploration process through the incorporation of external knowledge or the guidance of a risk metric. We use the proposed classification to survey the existing literature, as well as suggesting future directions for Safe Reinforcement Learning.
translated by 谷歌翻译
The reinforcement learning paradigm is a popular way to address problems that have only limited environmental feedback, rather than correctly labeled examples, as is common in other machine learning contexts. While significant progress has been made to improve learning in a single task, the idea of transfer learning has only recently been applied to reinforcement learning tasks. The core idea of transfer is that experience gained in learning to perform one task can help improve learning performance in a related, but different, task. In this article we present a framework that classifies transfer learning methods in terms of their capabilities and goals, and then use it to survey the existing literature, as well as to suggest future directions for transfer learning work.
translated by 谷歌翻译
移动通知系统在各种应用程序中起着重要作用,以通信,向用户发送警报和提醒,以告知他们有关新闻,事件或消息的信息。在本文中,我们将近实时的通知决策问题制定为马尔可夫决策过程,在该过程中,我们对奖励中的多个目标进行了优化。我们提出了一个端到端的离线增强学习框架,以优化顺序通知决策。我们使用基于保守的Q学习的双重Q网络方法来应对离线学习的挑战,从而减轻了分配转移问题和Q值高估。我们说明了完全部署的系统,并通过离线和在线实验证明了拟议方法的性能和好处。
translated by 谷歌翻译
Reinforcement-learning agents seek to maximize a reward signal through environmental interactions. As humans, our contribution to the learning process is through designing the reward function. Like programmers, we have a behavior in mind and have to translate it into a formal specification, namely rewards. In this work, we consider the reward-design problem in tasks formulated as reaching desirable states and avoiding undesirable states. To start, we propose a strict partial ordering of the policy space. We prefer policies that reach the good states faster and with higher probability while avoiding the bad states longer. Next, we propose an environment-independent tiered reward structure and show it is guaranteed to induce policies that are Pareto-optimal according to our preference relation. Finally, we empirically evaluate tiered reward functions on several environments and show they induce desired behavior and lead to fast learning.
translated by 谷歌翻译
In many risk-aware and multi-objective reinforcement learning settings, the utility of the user is derived from a single execution of a policy. In these settings, making decisions based on the average future returns is not suitable. For example, in a medical setting a patient may only have one opportunity to treat their illness. Making decisions using just the expected future returns -- known in reinforcement learning as the value -- cannot account for the potential range of adverse or positive outcomes a decision may have. Therefore, we should use the distribution over expected future returns differently to represent the critical information that the agent requires at decision time by taking both the future and accrued returns into consideration. In this paper, we propose two novel Monte Carlo tree search algorithms. Firstly, we present a Monte Carlo tree search algorithm that can compute policies for nonlinear utility functions (NLU-MCTS) by optimising the utility of the different possible returns attainable from individual policy executions, resulting in good policies for both risk-aware and multi-objective settings. Secondly, we propose a distributional Monte Carlo tree search algorithm (DMCTS) which extends NLU-MCTS. DMCTS computes an approximate posterior distribution over the utility of the returns, and utilises Thompson sampling during planning to compute policies in risk-aware and multi-objective settings. Both algorithms outperform the state-of-the-art in multi-objective reinforcement learning for the expected utility of the returns.
translated by 谷歌翻译
We study fair multi-objective reinforcement learning in which an agent must learn a policy that simultaneously achieves high reward on multiple dimensions of a vector-valued reward. Motivated by the fair resource allocation literature, we model this as an expected welfare maximization problem, for some non-linear fair welfare function of the vector of long-term cumulative rewards. One canonical example of such a function is the Nash Social Welfare, or geometric mean, the log transform of which is also known as the Proportional Fairness objective. We show that even approximately optimal optimization of the expected Nash Social Welfare is computationally intractable even in the tabular case. Nevertheless, we provide a novel adaptation of Q-learning that combines non-linear scalarized learning updates and non-stationary action selection to learn effective policies for optimizing nonlinear welfare functions. We show that our algorithm is provably convergent, and we demonstrate experimentally that our approach outperforms techniques based on linear scalarization, mixtures of optimal linear scalarizations, or stationary action selection for the Nash Social Welfare Objective.
translated by 谷歌翻译
在人类中,感知意识促进了来自感官输入的快速识别和提取信息。这种意识在很大程度上取决于人类代理人如何与环境相互作用。在这项工作中,我们提出了主动神经生成编码,用于学习动作驱动的生成模型的计算框架,而不会在动态环境中反正出错误(Backprop)。具体而言,我们开发了一种智能代理,即使具有稀疏奖励,也可以从规划的认知理论中汲取灵感。我们展示了我们框架与深度Q学习竞争力的几个简单的控制问题。我们的代理的强劲表现提供了有希望的证据,即神经推断和学习的无背方法可以推动目标定向行为。
translated by 谷歌翻译
在空间显式的基于个别模型中捕获和模拟智能自适应行为仍然是研究人员持续的挑战。虽然收集了不断增长的现实行为数据,但存在很少的方法,可以量化和正式化关键的个人行为以及它们如何改变空间和时间。因此,通常使用的常用代理决策框架(例如事件条件 - 行动规则)通常只需要仅关注狭窄的行为范围。我们认为,这些行为框架通常不会反映现实世界的情景,并且未能捕捉如何以响应刺激而发展行为。对机器学习方法的兴趣增加了近年来模拟智能自适应行为的兴趣。在该区域中开始获得牵引的一种方法是增强学习(RL)。本文探讨了如何使用基于简单的捕食者 - 猎物代理的模型(ABM)来应用RL创建紧急代理行为。运行一系列模拟,我们证明了使用新型近端政策优化(PPO)算法培训的代理以展示现实世界智能自适应行为的性质,例如隐藏,逃避和觅食。
translated by 谷歌翻译
Monte Carlo Tree Search (MCTS) is a recently proposed search method that combines the precision of tree search with the generality of random sampling. It has received considerable interest due to its spectacular success in the difficult problem of computer Go, but has also proved beneficial in a range of other domains. This paper is a survey of the literature to date, intended to provide a snapshot of the state of the art after the first five years of MCTS research. We outline the core algorithm's derivation, impart some structure on the many variations and enhancements that have been proposed, and summarise the results from the key game and non-game domains to which MCTS methods have been applied. A number of open research questions indicate that the field is ripe for future work.
translated by 谷歌翻译
我们考虑单个强化学习与基于事件驱动的代理商金融市场模型相互作用时学习最佳执行代理的学习动力。交易在事件时间内通过匹配引擎进行异步进行。最佳执行代理在不同级别的初始订单尺寸和不同尺寸的状态空间上进行考虑。使用校准方法考虑了对基于代理的模型和市场的影响,该方法探讨了经验性风格化事实和价格影响曲线的变化。收敛,音量轨迹和动作痕迹图用于可视化学习动力学。这表明了最佳执行代理如何在模拟的反应性市场框架内学习最佳交易决策,以及如何通过引入战略订单分类来改变模拟市场的反反应。
translated by 谷歌翻译
标准深度强化学习(DRL)旨在考虑收集的经验在制定政策方面的经验,旨在最大程度地提高预期奖励。这与人类决策不同,在人类的决策中,收益和损失的重视程度有所不同,而外围的结果被越来越多。它也无法利用通过合并分配环境来提高安全性和/或绩效的机会。已经研究了几种分配DRL的方法,其中一种流行的策略是评估预计的可能行动收益分配。我们提出了一种更直接的方法,通过优化了根据全剧集奖励的分布累积分布函数(CDF)指定的风险敏感目标。这种方法允许根据相对质量权衡结果,可用于连续和离散的动作空间,并且自然可以在约束和不受约束的设置中应用。我们展示了如何通过抽样来计算广泛的风险敏感目标的政策梯度的渐近一致估计,随后纳入了降低方差和正则化措施,以促进有效的实质性学习。然后,我们证明使用中等“悲观”的风险概况,强调了代理商表现不佳的场景,从而导致了增强的探索,并不断地专注于解决缺陷。我们在六个OpenAI安全健身房环境中使用不同的风险概况测试了该方法,与最先进的政策方法相比。没有成本限制,我们发现悲观的风险概况可用于降低成本,同时改善总奖励积累。借助成本限制,他们可以以规定的允许成本提供比风险中立的方法更高的积极奖励。
translated by 谷歌翻译
我们回顾了有关模型的文献,这些文献试图解释具有金钱回报的正常形式游戏所描述的社交互动中的人类行为。我们首先涵盖社会和道德偏好。然后,我们专注于日益增长的研究,表明人们对描述行动的语言做出反应,尤其是在激活道德问题时。最后,我们认为行为经济学正处于向基于语言的偏好转变的范式中,这将需要探索新的模型和实验设置。
translated by 谷歌翻译
加强学习课程学习是一种越来越流行的技术,涉及培训代理的代理,称为课程的一系列中级任务,以提高代理商的性能和学习速度。本文介绍了基于进展和映射函数的课程生成的新颖范式。虽然riveSion函数在任何给定时间指定环境的复杂性,但映射函数生成特定复杂性的环境。介绍了不同的进展功能,包括基于代理商的性能的自主在线任务进度。通过在六个域上的两个最先进的课程学习算法,通过凭借其对六个域的两个最先进的课程学习算法来显示我们的方法的益处和广泛的适用性。
translated by 谷歌翻译