推荐系统(RS)向用户显示的内容会影响他们。 Therefore, when choosing a recommender to deploy, one is implicitly also choosing to induce specific internal states in users.更重要的是,通过长匹马优化培训的系统将有直接的激励措施来操纵用户:在这项工作中,我们专注于转移用户偏好的动力,因此他们更容易满足。我们认为 - 在部署之前 - 系统设计师应:估计推荐人会引起的转变;评估这种转变是否是不受欢迎的;也许甚至可以积极优化以避免有问题的转变。这些步骤涉及两种具有挑战性的成分:估算需要预测假设算法如何影响用户偏好,如果部署 - 我们通过使用历史用户交互数据来训练隐含其偏好动态的预测用户模型来实现此操作;评估和优化另外需要指标来评估这种影响是操纵还是其他不必要的 - 我们使用“安全转移”的概念,该概念定义了行为安全的信任区域:例如,用户无需移动的自然方式而无需使用系统的干扰可以被视为“安全”。在模拟实验中,我们表明我们学习的偏好动力学模型可有效估计用户偏好以及它们如何对新推荐人的反应。此外,我们表明,在信托区域中优化的推荐人可以避免在仍在产生参与的同时避免操纵行为。
translated by 谷歌翻译
设计为与时间变化的偏好保持一致的内容的推荐系统需要正确地计算建议对人类行为和心理状况的反馈影响。我们认为,建模建议对人们偏好的影响必须基于心理合理的模型。我们为开发接地动态偏好模型提供了一种方法。我们通过模型来证明这种方法,这些模型从心理学文献中捕获了三种经典效果:裸露,操作条件和享乐调整。我们进行基于仿真的研究,以表明心理模型表现出可以为系统设计提供信息的不同行为。我们的研究对建议系统中的动态用户建模有两个直接影响。首先,我们概述的方法广泛适用于心理基础动态偏好模型。它使我们能够根据他们对心理基础及其难以置信的预测的有限讨论来批评最近的贡献。其次,我们讨论动态偏好模型对建议系统评估和设计的含义。在一个示例中,我们表明参与度和多样性指标可能无法捕获理想的建议系统性能。
translated by 谷歌翻译
工业推荐系统处理极大的行动空间 - 许多数百万的项目推荐。此外,他们需要为数十亿用户服务,他们在任何时间点都是独一无止的,制作复杂的用户状态空间。幸运的是,可以学习大量记录的隐式反馈(例如,用户点击,停留时间)。然而,从记录的反馈中学习,才受到仅通过以前版本的推荐器选择的建议的反馈而导致的偏差。在这项工作中,我们展示了在YouTube的生产Top-K推荐系统中解决此类偏差的一般配方,以策略梯度为基础的算法,即加强。本文的贡献是:(1)缩放到生产推荐系统,以数百万的订单为行动空间; (2)申请违规纠正以解决从多种行为策略收集的记录反馈中学习数据偏差; (3)提出新的Top-K违规纠正,以占我们的政策一次推荐多个项目; (4)展示勘探的价值。我们展示了我们通过一系列模拟和youtube上的多个实时实验的方法。
translated by 谷歌翻译
到目前为止,大多数关于推荐系统的研究专注于通过促进相关和个性化内容维持长期用户参与和满足感。但是,评估这种内容的质量和可靠性仍然非常具有挑战性。在本文中,我们提出了FEBR(基于专家的建议框架),是评估在线平台上建议内容的质量的学徒学习框架。该框架在推荐评估环境中挖掘专家(假设可靠)的演示轨迹,以恢复未知的实用程序功能。此功能用于学习描述专家行为的最佳策略,然后在框架中使用,以提供高质量和个性化的建议。我们通过用户兴趣模拟环境(使用RECSIM)评估我们的解决方案的性能。我们模拟了上述专家政策下的互动,以进行视频推荐,并将其效率与标准推荐方法进行比较。结果表明,我们的方法在内容质量方面提供了显着的收益,由专家评估并由用户观察,同时保持与基线方法几乎相同的表格。
translated by 谷歌翻译
这项调查旨在全面概述用户与推荐系统之间的相互作用和M&S应用程序之间的相互作用的最新趋势(M&S),以改善工业推荐引擎的性能。我们从实施模拟器的框架开发的动机开始,以及它们用于培训和测试不同类型(包括强化学习)的推荐系统的使用。此外,我们根据现有模拟器的功能,认可和工业有效性提供了新的一致分类,并总结了研究文献中发现的模拟器。除其他事情外,我们还讨论了模拟器的构建块:合成数据(用户,项目,用户项目响应)的生成,用于模拟质量评估的方法和数据集(包括监视的方法)和/或关闭可能的模拟到现实差距),以及用于汇总实验仿真结果的方法。最后,这项调查考虑了该领域的新主题和开放问题。
translated by 谷歌翻译
通过观察自己的行为来了解决策者的优先事项对于在医疗保健等决策过程中的透明度和问责制至关重要。尽管传统的政策学习方法几乎总是假定行为的平稳性,但在实践中几乎不正确:随着临床专业人员随着时间的流逝,医学实践不断发展。例如,随着医学界对器官移植的理解多年来的发展,一个相关的问题是:实际的器官分配政策如何发展?为了给出答案,我们希望采用一种政策学习方法,该方法提供了可解释的决策代表,尤其是捕获代理商对世界的非统计知识,并以离线方式运作。首先,我们将决策者的不断发展的行为对上下文的强盗进行了建模,并正式化了背景匪徒(ICB)的问题。其次,我们提出了两种混凝土算法作为解决方案,学习代理行为的学习参数和非参数表示。最后,使用真实和模拟数据进行肝移植,我们说明了我们方法的适用性和解释性,以及基准测试并验证其准确性。
translated by 谷歌翻译
When robots interact with humans in homes, roads, or factories the human's behavior often changes in response to the robot. Non-stationary humans are challenging for robot learners: actions the robot has learned to coordinate with the original human may fail after the human adapts to the robot. In this paper we introduce an algorithmic formalism that enables robots (i.e., ego agents) to co-adapt alongside dynamic humans (i.e., other agents) using only the robot's low-level states, actions, and rewards. A core challenge is that humans not only react to the robot's behavior, but the way in which humans react inevitably changes both over time and between users. To deal with this challenge, our insight is that -- instead of building an exact model of the human -- robots can learn and reason over high-level representations of the human's policy and policy dynamics. Applying this insight we develop RILI: Robustly Influencing Latent Intent. RILI first embeds low-level robot observations into predictions of the human's latent strategy and strategy dynamics. Next, RILI harnesses these predictions to select actions that influence the adaptive human towards advantageous, high reward behaviors over repeated interactions. We demonstrate that -- given RILI's measured performance with users sampled from an underlying distribution -- we can probabilistically bound RILI's expected performance across new humans sampled from the same distribution. Our simulated experiments compare RILI to state-of-the-art representation and reinforcement learning baselines, and show that RILI better learns to coordinate with imperfect, noisy, and time-varying agents. Finally, we conduct two user studies where RILI co-adapts alongside actual humans in a game of tag and a tower-building task. See videos of our user studies here: https://youtu.be/WYGO5amDXbQ
translated by 谷歌翻译
评估网络协议的真实表现是具有挑战性的。随机控制试验(RCT)对大多数研究人员来说是昂贵的并且无法进入,而专业设计的模拟器则无法捕获真实网络中的复杂行为。我们呈现MaunAlim,一种数据驱动的模拟器,用于解决这一挑战的网络协议。由于数据收集期间使用的协议引入的偏差,从观察数据中学习网络行为是复杂的。 MakAlAIM在一组协议下使用来自初始RCT的迹线来学习因果网络模型,有效地去除数据中存在的偏差。然后,使用此模型,可以在同一迹线上模拟任何协议(即,用于反事实预测)。因果的关键是对来自来自RCT的训练数据引起的分布修正因的对抗性神经网络培训进行了新的使用。我们对实际和合成数据集的MAURALAIM的广泛评估以及来自河豚视频流系统的两种用例,包括来自河豚视频流系统的超过九个月的实际数据,表明它提供了准确的反事预测,将预测误差降低了44%和53%平均值与专家设计和标准的监督学习基线相比。
translated by 谷歌翻译
Adequately assigning credit to actions for future outcomes based on their contributions is a long-standing open challenge in Reinforcement Learning. The assumptions of the most commonly used credit assignment method are disadvantageous in tasks where the effects of decisions are not immediately evident. Furthermore, this method can only evaluate actions that have been selected by the agent, making it highly inefficient. Still, no alternative methods have been widely adopted in the field. Hindsight Credit Assignment is a promising, but still unexplored candidate, which aims to solve the problems of both long-term and counterfactual credit assignment. In this thesis, we empirically investigate Hindsight Credit Assignment to identify its main benefits, and key points to improve. Then, we apply it to factored state representations, and in particular to state representations based on the causal structure of the environment. In this setting, we propose a variant of Hindsight Credit Assignment that effectively exploits a given causal structure. We show that our modification greatly decreases the workload of Hindsight Credit Assignment, making it more efficient and enabling it to outperform the baseline credit assignment method on various tasks. This opens the way to other methods based on given or learned causal structures.
translated by 谷歌翻译
已经引入了生成流量网络(GFlowNETS)作为在主动学习背景下采样多样化候选的方法,具有培训目标,其使它们与给定奖励功能成比例地进行比例。在本文中,我们显示了许多额外的GFLOWN的理论特性。它们可用于估计联合概率分布和一些变量未指定的相应边际分布,并且特别感兴趣地,可以代表像集合和图形的复合对象的分布。 Gflownets摊销了通常通过计算昂贵的MCMC方法在单个但训练有素的生成通行证中进行的工作。它们还可用于估计分区功能和自由能量,给定子集(子图)的超标(超图)的条件概率,以及给定集合(图)的所有超标仪(超图)的边际分布。我们引入了熵和相互信息估计的变体,从帕累托前沿采样,与奖励最大化策略的连接,以及随机环境的扩展,连续动作和模块化能量功能。
translated by 谷歌翻译
我们介绍了概率等级和奖励模型(PRR),这是一个可扩展的概率模型,用于个性化的Slate建议。我们的模型允许在以下无处不在的推荐系统方案中对用户兴趣的最新估计:向用户显示了k个建议的板岩,用户最多可以选择这些K项目中的一个。推荐系统的目标是找到用户最感兴趣的K项目,以最大程度地提高用户与Slate交互的可能性。我们的贡献是表明,我们可以通过结合奖励(无论是否单击板岩,以及等级)而更有效地学习建议成功的可能性。我们的方法比仅使用奖励和仅使用等级的用户偏好方法的盗销方法更有效地学习。它还提供了与独立的逆点分数方法相似或更好的估计性能,并且更可扩展。我们的方法是在大量数据集中的速度和准确性方面的最高速度,最多100万个项目。最后,我们的方法允许快速交付由最大内部产品搜索(MIPS)提供动力的建议,使其适用于极低的延迟域,例如计算广告。
translated by 谷歌翻译
当从人类行为中推断出奖励功能(无论是演示,比较,物理校正或电子停靠点)时,它已证明对人类进行建模作为做出嘈杂的理性选择,并具有“合理性系数”,以捕获多少噪声或熵我们希望看到人类的行为。无论人类反馈的类型或质量如何,许多现有作品都选择修复此系数。但是,在某些情况下,进行演示可能要比回答比较查询要困难得多。在这种情况下,我们应该期望在示范中看到比比较中更多的噪音或次级临时性,并且应该相应地解释反馈。在这项工作中,我们提倡,将每种反馈类型的实际数据中的理性系数扎根,而不是假设默认值,对奖励学习具有重大的积极影响。我们在模拟反馈以及用户研究的实验中测试了这一点。我们发现,从单一反馈类型中学习时,高估人类理性可能会对奖励准确性和遗憾产生可怕的影响。此外,我们发现合理性层面会影响每种反馈类型的信息性:令人惊讶的是,示威并不总是最有用的信息 - 当人类的行为非常卑鄙时,即使在合理性水平相同的情况下,比较实际上就变得更加有用。 。此外,当机器人确定要要求的反馈类型时,它可以通过准确建模每种类型的理性水平来获得很大的优势。最终,我们的结果强调了关注假定理性级别的重要性,不仅是在从单个反馈类型中学习时,尤其是当代理商从多种反馈类型中学习时,尤其是在学习时。
translated by 谷歌翻译
Transformer, originally devised for natural language processing, has also attested significant success in computer vision. Thanks to its super expressive power, researchers are investigating ways to deploy transformers to reinforcement learning (RL) and the transformer-based models have manifested their potential in representative RL benchmarks. In this paper, we collect and dissect recent advances on transforming RL by transformer (transformer-based RL or TRL), in order to explore its development trajectory and future trend. We group existing developments in two categories: architecture enhancement and trajectory optimization, and examine the main applications of TRL in robotic manipulation, text-based games, navigation and autonomous driving. For architecture enhancement, these methods consider how to apply the powerful transformer structure to RL problems under the traditional RL framework, which model agents and environments much more precisely than deep RL methods, but they are still limited by the inherent defects of traditional RL algorithms, such as bootstrapping and "deadly triad". For trajectory optimization, these methods treat RL problems as sequence modeling and train a joint state-action model over entire trajectories under the behavior cloning framework, which are able to extract policies from static datasets and fully use the long-sequence modeling capability of the transformer. Given these advancements, extensions and challenges in TRL are reviewed and proposals about future direction are discussed. We hope that this survey can provide a detailed introduction to TRL and motivate future research in this rapidly developing field.
translated by 谷歌翻译
Humans have internal models of robots (like their physical capabilities), the world (like what will happen next), and their tasks (like a preferred goal). However, human internal models are not always perfect: for example, it is easy to underestimate a robot's inertia. Nevertheless, these models change and improve over time as humans gather more experience. Interestingly, robot actions influence what this experience is, and therefore influence how people's internal models change. In this work we take a step towards enabling robots to understand the influence they have, leverage it to better assist people, and help human models more quickly align with reality. Our key idea is to model the human's learning as a nonlinear dynamical system which evolves the human's internal model given new observations. We formulate a novel optimization problem to infer the human's learning dynamics from demonstrations that naturally exhibit human learning. We then formalize how robots can influence human learning by embedding the human's learning dynamics model into the robot planning problem. Although our formulations provide concrete problem statements, they are intractable to solve in full generality. We contribute an approximation that sacrifices the complexity of the human internal models we can represent, but enables robots to learn the nonlinear dynamics of these internal models. We evaluate our inference and planning methods in a suite of simulated environments and an in-person user study, where a 7DOF robotic arm teaches participants to be better teleoperators. While influencing human learning remains an open problem, our results demonstrate that this influence is possible and can be helpful in real human-robot interaction.
translated by 谷歌翻译
Recommender systems aim to answer the following question: given the items that a user has interacted with, what items will this user likely interact with next? Historically this problem is often framed as a predictive task via (self-)supervised learning. In recent years, we have seen more emphasis placed on approaching the recommendation problem from a policy optimization perspective: learning a policy that maximizes some reward function (e.g., user engagement). However, it is commonly the case in recommender systems that we are only able to train a new policy given data collected from a previously-deployed policy. The conventional way to address such a policy mismatch is through importance sampling correction, which unfortunately comes with its own limitations. In this paper, we suggest an alternative approach, which involves the use of local policy improvement without off-policy correction. Drawing from a number of related results in the fields of causal inference, bandits, and reinforcement learning, we present a suite of methods that compute and optimize a lower bound of the expected reward of the target policy. Crucially, this lower bound is a function that is easy to estimate from data, and which does not involve density ratios (such as those appearing in importance sampling correction). We argue that this local policy improvement paradigm is particularly well suited for recommender systems, given that in practice the previously-deployed policy is typically of reasonably high quality, and furthermore it tends to be re-trained frequently and gets continuously updated. We discuss some practical recipes on how to apply some of the proposed techniques in a sequential recommendation setting.
translated by 谷歌翻译
While reinforcement learning (RL) has become a more popular approach for robotics, designing sufficiently informative reward functions for complex tasks has proven to be extremely difficult due their inability to capture human intent and policy exploitation. Preference based RL algorithms seek to overcome these challenges by directly learning reward functions from human feedback. Unfortunately, prior work either requires an unreasonable number of queries implausible for any human to answer or overly restricts the class of reward functions to guarantee the elicitation of the most informative queries, resulting in models that are insufficiently expressive for realistic robotics tasks. Contrary to most works that focus on query selection to \emph{minimize} the amount of data required for learning reward functions, we take an opposite approach: \emph{expanding} the pool of available data by viewing human-in-the-loop RL through the more flexible lens of multi-task learning. Motivated by the success of meta-learning, we pre-train preference models on prior task data and quickly adapt them for new tasks using only a handful of queries. Empirically, we reduce the amount of online feedback needed to train manipulation policies in Meta-World by 20$\times$, and demonstrate the effectiveness of our method on a real Franka Panda Robot. Moreover, this reduction in query-complexity allows us to train robot policies from actual human users. Videos of our results and code can be found at https://sites.google.com/view/few-shot-preference-rl/home.
translated by 谷歌翻译
有效推论是一种数学框架,它起源于计算神经科学,作为大脑如何实现动作,感知和学习的理论。最近,已被证明是在不确定性下存在国家估算和控制问题的有希望的方法,以及一般的机器人和人工代理人的目标驱动行为的基础。在这里,我们审查了最先进的理论和对国家估计,控制,规划和学习的积极推断的实现;描述当前的成就,特别关注机器人。我们展示了相关实验,以适应,泛化和稳健性而言说明其潜力。此外,我们将这种方法与其他框架联系起来,并讨论其预期的利益和挑战:使用变分贝叶斯推理具有功能生物合理性的统一框架。
translated by 谷歌翻译
学习涉及时变和不断发展的系统动态的控制政策通常对主流强化学习算法构成了巨大的挑战。在大多数标准方法中,通常认为动作是一组刚性的,固定的选择,这些选择以预定义的方式顺序应用于状态空间。因此,在不诉诸于重大学习过程的情况下,学识渊博的政策缺乏适应动作集和动作的“行为”结果的能力。此外,标准行动表示和动作引起的状态过渡机制固有地限制了如何将强化学习应用于复杂的现实世界应用中,这主要是由于所得大的状态空间的棘手性以及缺乏概括的学术知识对国家空间未知部分的政策。本文提出了一个贝叶斯味的广义增强学习框架,首先建立参数动作模型的概念,以更好地应对不确定性和流体动作行为,然后将增强领域的概念作为物理启发的结构引入通过“极化体验颗粒颗粒建立) “维持在学习代理的工作记忆中。这些粒子有效地编码了以自组织方式随时间演变的动态学习体验。在强化领域之上,我们将进一步概括策略学习过程,以通过将过去的记忆视为具有隐式图结构来结合高级决策概念,在该结构中,过去的内存实例(或粒子)与决策之间的相似性相互联系。定义,因此,可以应用“关联记忆”原则来增强学习代理的世界模型。
translated by 谷歌翻译
在本文中,我们提出了一种方法,用于预测社交媒体对等体之间的信任链接,其中一个是在多识别信任建模的人工智能面积。特别是,我们提出了一种数据驱动的多面信任信任建模,该信任建模包括许多不同的特征以进行全面分析。我们专注于展示类似用户的聚类如何实现关键新功能:支持更个性化的,从而为用户提供更准确的预测。在信任感知项目推荐任务中说明,我们在大yelp数据集的上下文中评估所提出的框架。然后,我们讨论如何提高社交媒体的可信关系的检测可以帮助在最近爆发的社交网络环境中支持在线用户的违法行为和谣言的传播。我们的结论是关于一个特别易受资助的用户基础,老年人的反思,以说明关于用户组的推理价值,期望通过通过数据分析获得的洞察力集成已知偏好的一些未来方向。
translated by 谷歌翻译
In an era of countless content offerings, recommender systems alleviate information overload by providing users with personalized content suggestions. Due to the scarcity of explicit user feedback, modern recommender systems typically optimize for the same fixed combination of implicit feedback signals across all users. However, this approach disregards a growing body of work highlighting that (i) implicit signals can be used by users in diverse ways, signaling anything from satisfaction to active dislike, and (ii) different users communicate preferences in different ways. We propose applying the recent Interaction Grounded Learning (IGL) paradigm to address the challenge of learning representations of diverse user communication modalities. Rather than taking a fixed, human-designed reward function, IGL is able to learn personalized reward functions for different users and then optimize directly for the latent user satisfaction. We demonstrate the success of IGL with experiments using simulations as well as with real-world production traces.
translated by 谷歌翻译