我们研究依靠敏感数据(例如医疗记录)的环境的顺序决策中,研究隐私的探索。特别是,我们专注于解决在线性MDP设置中受(联合)差异隐私的约束的增强学习问题(RL),在该设置中,动态和奖励均由线性函数给出。由于Luyo等人而引起的此问题的事先工作。 (2021)实现了$ o(k^{3/5})$的依赖性的遗憾率。我们提供了一种私人算法,其遗憾率提高,最佳依赖性为$ o(\ sqrt {k})$对情节数量。我们强烈遗憾保证的关键配方是策略更新时间表中的适应性,其中仅在检测到数据足够更改时才发生更新。结果,我们的算法受益于低切换成本,并且仅执行$ o(\ log(k))$更新,这大大降低了隐私噪声的量。最后,在最普遍的隐私制度中,隐私参数$ \ epsilon $是一个常数,我们的算法会造成可忽略不计的隐私成本 - 与现有的非私人遗憾界限相比,由于隐私而引起的额外遗憾在低阶中出现了术语。
translated by 谷歌翻译
在差异隐私(DP)的约束下,我们在有限地域表格马尔可夫决策过程(MDP)中研究了遗憾最小化。这是由强化学习(RL)在现实世界顺序决策中的广泛应用程序的推动,保护用户敏感和私人信息变得最大程度。我们考虑了两种DP - 关节DP(JDP)的变体,其中集中式代理负责保护用户的敏感数据和本地DP(LDP),其中需要直接在用户端保护信息。我们首先提出了两个一般框架 - 一个用于策略优化,另一个用于迭代 - 用于设计私有,乐观的RL算法。然后,我们将这些框架实例化了合适的隐私机制来满足JDP和LDP要求,并同时获得Sublinear遗憾担保。遗憾的界限表明,在JDP下,隐私费用只是较低的秩序添加剂项,而在LDP下,对于更强的隐私保护,遭受的成本是乘法的。最后,通过统一的分析获得了遗憾范围,我们相信,我们相信,可以超出表格MDP。
translated by 谷歌翻译
本文研究了Markov决策过程(MDP)的隐私保留探索,线性表示。我们首先考虑线性混合MDP(Ayoub等,2020)(A.K.A.基于模型的设置)的设置,并提供统一的框架,用于分析关节和局部差异私有(DP)探索。通过这个框架,我们证明了一个$ \ widetilde {o}(k ^ {3/4} / \ sqrt {\ epsilon})$遗憾绑定$(\ epsilon,\ delta)$ - 本地DP探索和$ \widetilde {o}(\ sqrt {k / \ epsilon})$后悔绑定$(\ epsilon,\ delta)$ - 联合dp。我们进一步研究了Linear MDP中的隐私保留探索(Jin等,2020)(AKA \ Forws-Free Setting),我们提供$ \ widetilde {o}(\ sqrt {k / \ epsilon})$后悔绑定$(\ epsilon,\ delta)$ - 关节dp,具有基于低切换的新型算法。最后,我们提供了在这种无模型设置中设计本地DP算法的问题的见解。
translated by 谷歌翻译
Motivated by personalized healthcare and other applications involving sensitive data, we study online exploration in reinforcement learning with differential privacy (DP) constraints. Existing work on this problem established that no-regret learning is possible under joint differential privacy (JDP) and local differential privacy (LDP) but did not provide an algorithm with optimal regret. We close this gap for the JDP case by designing an $\epsilon$-JDP algorithm with a regret of $\widetilde{O}(\sqrt{SAH^2T}+S^2AH^3/\epsilon)$ which matches the information-theoretic lower bound of non-private learning for all choices of $\epsilon> S^{1.5}A^{0.5} H^2/\sqrt{T}$. In the above, $S$, $A$ denote the number of states and actions, $H$ denotes the planning horizon, and $T$ is the number of steps. To the best of our knowledge, this is the first private RL algorithm that achieves \emph{privacy for free} asymptotically as $T\rightarrow \infty$. Our techniques -- which could be of independent interest -- include privately releasing Bernstein-type exploration bonuses and an improved method for releasing visitation statistics. The same techniques also imply a slightly improved regret bound for the LDP case.
translated by 谷歌翻译
The offline reinforcement learning (RL) problem is often motivated by the need to learn data-driven decision policies in financial, legal and healthcare applications. However, the learned policy could retain sensitive information of individuals in the training data (e.g., treatment and outcome of patients), thus susceptible to various privacy risks. We design offline RL algorithms with differential privacy guarantees which provably prevent such risks. These algorithms also enjoy strong instance-dependent learning bounds under both tabular and linear Markov decision process (MDP) settings. Our theory and simulation suggest that the privacy guarantee comes at (almost) no drop in utility comparing to the non-private counterpart for a medium-size dataset.
translated by 谷歌翻译
获取一阶遗憾界限 - 遗憾的界限不是作为最坏情况,但有一些衡量给定实例的最佳政策的性能 - 是连续决策的核心问题。虽然这种界限存在于许多设置中,但它们在具有大状态空间的钢筋学习中被证明是难以捉摸的。在这项工作中,我们解决了这个差距,并表明可以将遗憾的缩放作为$ \ mathcal {o}(\ sqrt {v_1 ^ \ star})$中的钢筋学习,即用大状态空间,即线性MDP设置。这里$ v_1 ^ \ star $是最佳政策的价值,$ k $是剧集的数量。我们证明基于最小二乘估计的现有技术不足以获得该结果,而是基于强大的Catoni平均估计器制定一种新的稳健自归一化浓度,其可能具有独立兴趣。
translated by 谷歌翻译
我们在适应性约束下研究了强化学习(RL),线性函数近似。我们考虑两个流行的有限适应性模型:批量学习模型和稀有策略交换机模型,并提出了两个有效的在线线性马尔可夫决策过程的在线RL算法,其中转换概率和奖励函数可以表示为一些线性函数已知的特征映射。具体而言,对于批量学习模型,我们提出的LSVI-UCB-批处理算法实现了$ \ tilde o(\ sqrt {d ^ 3h ^ 3t} + dht / b)$后悔,$ d $是尺寸特征映射,$ H $是剧集长度,$ t $是交互数量,$ b $是批次数。我们的结果表明,只使用$ \ sqrt {t / dh} $批量来获得$ \ tilde o(\ sqrt {d ^ 3h ^ 3t})$后悔。对于稀有策略开关模型,我们提出的LSVI-UCB-RARESWICH算法享有$ \ TINDE O(\ SQRT {D ^ 3h ^ 3t [1 + T /(DH)] ^ {dh / b})$遗憾,这意味着$ dh \ log t $策略交换机足以获得$ \ tilde o(\ sqrt {d ^ 3h ^ 3t})$后悔。我们的算法达到与LSVI-UCB算法相同的遗憾(Jin等,2019),但具有大量较小的适应性。我们还为批量学习模式建立了较低的界限,这表明对我们遗憾的依赖于您的遗憾界限是紧张的。
translated by 谷歌翻译
We study reinforcement learning (RL) with linear function approximation. For episodic time-inhomogeneous linear Markov decision processes (linear MDPs) whose transition dynamic can be parameterized as a linear function of a given feature mapping, we propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $\tilde O(d\sqrt{H^3K})$, where $d$ is the dimension of the feature mapping, $H$ is the planning horizon, and $K$ is the number of episodes. Our algorithm is based on a weighted linear regression scheme with a carefully designed weight, which depends on a new variance estimator that (1) directly estimates the variance of the \emph{optimal} value function, (2) monotonically decreases with respect to the number of episodes to ensure a better estimation accuracy, and (3) uses a rare-switching policy to update the value function estimator to control the complexity of the estimated value function class. Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
translated by 谷歌翻译
我们研究了受限的强化学习问题,其中代理的目的是最大程度地提高预期的累积奖励,从而受到对实用程序函数的预期总价值的约束。与现有的基于模型的方法或无模型方法伴随着“模拟器”,我们旨在开发第一个无模型的无模拟算法,即使在大规模系统中,也能够实现sublinear遗憾和透明度的约束侵犯。为此,我们考虑具有线性函数近似的情节约束决策过程,其中过渡动力学和奖励函数可以表示为某些已知功能映射的线性函数。我们表明$ \ tilde {\ mathcal {o}}(\ sqrt {d^3h^3t})$遗憾和$ \ tilde {\ tillcal {\ mathcal {o}}(\ sqrt {d^3h^3ht})$约束$约束$约束可以实现违规范围,其中$ d $是功能映射的尺寸,$ h $是情节的长度,而$ t $是总数的总数。我们的界限是在没有明确估计未知过渡模型或需要模拟器的情况下达到的,并且仅通过特征映射的维度依赖于状态空间。因此,即使国家的数量进入无穷大,我们的界限也会存在。我们的主要结果是通过标准LSVI-UCB算法的新型适应来实现的。特别是,我们首先将原始二次优化引入LSVI-UCB算法中,以在遗憾和违反约束之间取得平衡。更重要的是,我们使用软马克斯政策取代了LSVI-UCB中的状态行动功能的标准贪婪选择。事实证明,这对于通过其近似平滑度的权衡来确定受约束案例的统一浓度是关键。我们还表明,一个人可以达到均匀的约束违规行为,同时仍然保持相同的订单相对于$ t $。
translated by 谷歌翻译
Modern Reinforcement Learning (RL) is commonly applied to practical problems with an enormous number of states, where function approximation must be deployed to approximate either the value function or the policy. The introduction of function approximation raises a fundamental set of challenges involving computational and statistical efficiency, especially given the need to manage the exploration/exploitation tradeoff. As a result, a core RL question remains open: how can we design provably efficient RL algorithms that incorporate function approximation? This question persists even in a basic setting with linear dynamics and linear rewards, for which only linear function approximation is needed.This paper presents the first provable RL algorithm with both polynomial runtime and polynomial sample complexity in this linear setting, without requiring a "simulator" or additional assumptions. Concretely, we prove that an optimistic modification of Least-Squares Value Iteration (LSVI)-a classical algorithm frequently studied in the linear setting-achieves O( √ d 3 H 3 T ) regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps. Importantly, such regret is independent of the number of states and actions.
translated by 谷歌翻译
我们研究了一种强化学习理论(RL),其中学习者在情节结束时仅收到一次二进制反馈。尽管这是理论上的极端测试案例,但它也可以说是实际应用程序的代表性,而不是在RL实践中,学习者在每个时间步骤中都会收到反馈。的确,在许多实际应用的应用程序中,例如自动驾驶汽车和机器人技术,更容易评估学习者的完整轨迹要么是“好”还是“坏”,但是更难在每个方面提供奖励信号步。为了证明在这种更具挑战性的环境中学习是可能的,我们研究了轨迹标签由未知参数模型生成的情况,并提供了一种统计和计算上有效的算法,从而实现了sublinear遗憾。
translated by 谷歌翻译
我们研究了基于模型的无奖励加强学习,具有ePiSodic Markov决策过程的线性函数近似(MDP)。在此设置中,代理在两个阶段工作。在勘探阶段,代理商与环境相互作用并在没有奖励的情况下收集样品。在规划阶段,代理商给出了特定的奖励功能,并使用从勘探阶段收集的样品来学习良好的政策。我们提出了一种新的可直接有效的算法,称为UCRL-RFE在线性混合MDP假设,其中MDP的转换概率内核可以通过线性函数参数化,在状态,动作和下一个状态的三联体上定义的某些特征映射上参数化。我们展示了获得$ \ epsilon $-Optimal策略进行任意奖励函数,Ucrl-RFE需要以大多数$ \ tilde {\ mathcal {o}}来进行采样(h ^ 5d ^ 2 \ epsilon ^ { - 2})勘探阶段期间的$派对。在这里,$ H $是集的长度,$ d $是特征映射的尺寸。我们还使用Bernstein型奖金提出了一种UCRL-RFE的变种,并表明它需要在大多数$ \ TINDE {\ MATHCAL {o}}(H ^ 4D(H + D)\ epsilon ^ { - 2})进行样本$达到$ \ epsilon $ -optimal政策。通过构建特殊类的线性混合MDPS,我们还证明了对于任何无奖励算法,它需要至少为$ \ TINDE \ OMEGA(H ^ 2d \ epsilon ^ { - 2})$剧集来获取$ \ epsilon $ -optimal政策。我们的上限与依赖于$ \ epsilon $的依赖性和$ d $ if $ h \ ge d $。
translated by 谷歌翻译
我们研究了具有线性函数近似增强学习中的随机最短路径(SSP)问题,其中过渡内核表示为未知模型的线性混合物。我们将此类别的SSP问题称为线性混合物SSP。我们提出了一种具有Hoeffding-type置信度的新型算法,用于学习线性混合物SSP,可以获得$ \ tilde {\ Mathcal {o}}}}(d B _ {\ star}^{1.5} \ sqrt {k/c_ {k/c_ {k/c_ {k/c_ { \ min}})$遗憾。这里$ k $是情节的数量,$ d $是混合模型中功能映射的维度,$ b _ {\ star} $限制了最佳策略的预期累积成本,$ c _ {\ min}>> 0 $是成本函数的下限。当$ c _ {\ min} = 0 $和$ \ tilde {\ mathcal {o}}}(k^{2/3})$遗憾时,我们的算法也适用于情况。据我们所知,这是第一个具有sublrinear遗憾保证线性混合物SSP的算法。此外,我们设计了精致的伯恩斯坦型信心集并提出了改进的算法,该算法可实现$ \ tilde {\ Mathcal {o}}}(d b _ {\ star} \ sqrt {k/c/c/c {k/c _ {\ min}}) $遗憾。为了补充遗憾的上限,我们还证明了$ \ omega(db _ {\ star} \ sqrt {k})$的下限。因此,我们的改进算法将下限匹配到$ 1/\ sqrt {c _ {\ min}} $ factor和poly-logarithmic因素,从而实现了近乎最佳的遗憾保证。
translated by 谷歌翻译
无奖励强化学习(RL)考虑了代理在探索过程中无法访问奖励功能的设置,但必须提出仅在探索后才揭示的任意奖励功能的近乎最佳的政策。在表格环境中,众所周知,这是一个比奖励意识(PAC)RL(代理在探索过程中访问奖励功能)更困难的问题$ | \ Mathcal {s} | $,状态空间的大小。我们表明,在线性MDP的设置中,这种分离不存在。我们首先在$ d $二维线性MDP中开发了一种计算高效算法,其样品复杂度比例为$ \ widetilde {\ Mathcal {o}}(d^2 H^5/\ epsilon^2)$ 。然后,我们显示出$ \ omega(d^2 h^2/\ epsilon^2)$的匹配尺寸依赖性的下限,该限制为奖励感知的RL设置。据我们所知,我们的方法是第一个在线性MDP中实现最佳$ d $依赖性的计算有效算法,即使在单次奖励PAC设置中也是如此。我们的算法取决于一种新的程序,该过程有效地穿越了线性MDP,在任何给定的``特征方向''中收集样品,并在最大状态访问概率(线性MDP等效)中享受最佳缩放样品复杂性。我们表明,该探索过程也可以应用于解决线性MDP中````良好条件''''协变量的问题。
translated by 谷歌翻译
We study time-inhomogeneous episodic reinforcement learning (RL) under general function approximation and sparse rewards. We design a new algorithm, Variance-weighted Optimistic $Q$-Learning (VO$Q$L), based on $Q$-learning and bound its regret assuming completeness and bounded Eluder dimension for the regression function class. As a special case, VO$Q$L achieves $\tilde{O}(d\sqrt{HT}+d^6H^{5})$ regret over $T$ episodes for a horizon $H$ MDP under ($d$-dimensional) linear function approximation, which is asymptotically optimal. Our algorithm incorporates weighted regression-based upper and lower bounds on the optimal value function to obtain this improved regret. The algorithm is computationally efficient given a regression oracle over the function class, making this the first computationally tractable and statistically optimal approach for linear MDPs.
translated by 谷歌翻译
我们在非静止线性(AKA低级别)马尔可夫决策过程(MDP)中研究了集中加强学习,即奖励和转换内核都是关于给定特征映射的线性,并且被允许缓慢或突然演变时间。对于此问题设置,我们提出了一种基于加权最小二乘值的乐观模型算法的Opt-WLSVI,其使用指数权重来平滑地忘记过去远远的数据。我们表明我们的算法在每次竞争最佳政策时,实现了由$ \ widetilde {\ mathcal {o}}的上部界限的遗憾(d ^ {5/4} h ^ 2 \ delta ^ {1 / 4} k ^ {3/4})$何地在$ d $是特征空间的尺寸,$ h $是规划地平线,$ k $是剧集的数量和$ \ delta $是一个合适的衡量标准MDP的非固定性。此外,我们指出了在忘记以前作品的非静止线性匪徒环境中忘记策略的技术差距,并提出了修复其遗憾分析。
translated by 谷歌翻译
We consider a multi-agent episodic MDP setup where an agent (leader) takes action at each step of the episode followed by another agent (follower). The state evolution and rewards depend on the joint action pair of the leader and the follower. Such type of interactions can find applications in many domains such as smart grids, mechanism design, security, and policymaking. We are interested in how to learn policies for both the players with provable performance guarantee under a bandit feedback setting. We focus on a setup where both the leader and followers are {\em non-myopic}, i.e., they both seek to maximize their rewards over the entire episode and consider a linear MDP which can model continuous state-space which is very common in many RL applications. We propose a {\em model-free} RL algorithm and show that $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret bounds can be achieved for both the leader and the follower, where $d$ is the dimension of the feature mapping, $H$ is the length of the episode, and $T$ is the total number of steps under the bandit feedback information setup. Thus, our result holds even when the number of states becomes infinite. The algorithm relies on {\em novel} adaptation of the LSVI-UCB algorithm. Specifically, we replace the standard greedy policy (as the best response) with the soft-max policy for both the leader and the follower. This turns out to be key in establishing uniform concentration bound for the value functions. To the best of our knowledge, this is the first sub-linear regret bound guarantee for the Markov games with non-myopic followers with function approximation.
translated by 谷歌翻译
尽管在理解增强学习的最小样本复杂性(RL)(在“最坏情况”的实例上学习的复杂性)方面已经取得了很多进展,但这种复杂性的衡量标准通常不会捕捉到真正的学习困难。在实践中,在“简单”的情况下,我们可能希望获得比最糟糕的实例可以实现的要好得多。在这项工作中,我们试图理解在具有线性函数近似的RL设置中学习近乎最佳策略(PAC RL)的“实例依赖性”复杂性。我们提出了一种算法,\ textsc {pedel},该算法实现了依赖于实例的复杂性的量度,这是RL中的第一个具有功能近似设置,从而捕获了每个特定问题实例的学习难度。通过一个明确的示例,我们表明\ textsc {pedel}可以在低重晶,最小值 - 最佳算法上获得可证明的收益,并且这种算法无法达到实例 - 最佳速率。我们的方法取决于基于设计的新型实验程序,该程序将勘探预算重点放在与学习近乎最佳政策最相关的“方向”上,并且可能具有独立的兴趣。
translated by 谷歌翻译
我们在随机和对抗性马尔可夫决策过程(MDP)中研究合作在线学习。也就是说,在每一集中,$ m $代理商同时与MDP互动,并共享信息以最大程度地减少他们的遗憾。我们考虑具有两种随机性的环境:\ emph {Fresh} - 在每个代理的轨迹均已采样i.i.d和\ emph {non-fresh} - 其中所有代理人共享实现(但每个代理的轨迹也受到影响)通过其自己的行动)。更确切地说,通过非志趣相投的随机性,每个成本和过渡的实现都在每个情节开始时都固定了,并且在同一时间同时采取相同行动的代理人观察到相同的成本和下一个状态。我们彻底分析了所有相关设置,强调了模型之间的挑战和差异,并证明了几乎匹配的遗憾下层和上限。据我们所知,我们是第一个考虑具有非伪造随机性或对抗性MDP的合作强化学习(RL)。
translated by 谷歌翻译
Model-free reinforcement learning (RL) algorithms, such as Q-learning, directly parameterize and update value functions or policies without explicitly modeling the environment. They are typically simpler, more flexible to use, and thus more prevalent in modern deep RL than model-based approaches. However, empirical work has suggested that model-free algorithms may require more samples to learn [7,22]. The theoretical question of "whether model-free algorithms can be made sample efficient" is one of the most fundamental questions in RL, and remains unsolved even in the basic scenario with finitely many states and actions.We prove that, in an episodic MDP setting, Q-learning with UCB exploration achieves regret Õ( √ H 3 SAT ), where S and A are the numbers of states and actions, H is the number of steps per episode, and T is the total number of steps. This sample efficiency matches the optimal regret that can be achieved by any model-based approach, up to a single √ H factor. To the best of our knowledge, this is the first analysis in the model-free setting that establishes √ T regret without requiring access to a "simulator." * The first two authors contributed equally.
translated by 谷歌翻译