Off-policy evaluation methods are important in recommendation systems and search engines, where data collected under an existing logging policy is used to estimate the performance of a new proposed policy. A common approach to this problem is weighting, where data is weighted by a density ratio between the probability of actions given contexts in the target and logged policies. In practice, two issues often arise. First, many problems have very large action spaces and we may not observe rewards for most actions, and so in finite samples we may encounter a positivity violation. Second, many recommendation systems are not probabilistic and so having access to logging and target policy densities may not be feasible. To address these issues, we introduce the featurized embedded permutation weighting estimator. The estimator computes the density ratio in an action embedding space, which reduces the possibility of positivity violations. The density ratio is computed leveraging recent advances in normalizing flows and density ratio estimation as a classification problem, in order to obtain estimates which are feasible in practice.
translated by 谷歌翻译
在上下文土匪中,非政策评估(OPE)已在现实世界中迅速采用,因为它仅使用历史日志数据就可以离线评估新政策。不幸的是,当动作数量较大时,现有的OPE估计器(其中大多数是基于反相反的得分加权)会严重降解,并且可能会遭受极端偏见和差异。这挫败了从推荐系统到语言模型的许多应用程序中使用OPE。为了克服这个问题,我们提出了一个新的OPE估计器,即当动作嵌入在动作空间中提供结构时,利用边缘化的重要性权重。我们表征了所提出的估计器的偏差,方差和平方平方误差,并分析了动作嵌入提供了比常规估计器提供统计益处的条件。除了理论分析外,我们还发现,即使由于大量作用,现有估计量崩溃,经验性绩效的改善也可以实现可靠的OPE。
translated by 谷歌翻译
我们研究了一个定价设置,其中每个客户都基于客户和/或产品特征提供了一种预测客户对该产品的估值的产品特征。通常只有历史销售记录,我们遵守每个客户是否以规定的价格购买产品,而不是客户的真实估值。因此,数据受到历史销售政策的影响,历史销售政策在没有进行实际实验的可能性的情况下估算未来损失/遗憾的困难/遗憾的损失/遗憾,而是优化诸如收入管理等下游任务的新政策。我们研究如何制定损失功能,该功能可用于直接优化定价策略,而不是通过中间需求估计阶段,这可能在实践中被偏见,因为模型拼写,正常化或校准差。虽然在估值数据可用时提出了现有方法,但我们提出了观察数据设置的损失函数。为实现这一目标,我们将机器学习的想法适应损坏的标签,我们可以考虑每个观察到的客户的结果(购买或不按规定的价格购买),作为客户估值的(已知)概率转变。从这种转变,我们派生了一类合适的无偏损失功能。在此类中,我们识别最小方差估计器,那些对不良需求函数估计的稳健性,并在估计的需求功能有用时提供指导。此外,我们还表明,当应用于我们的上下文定价环境时,在违规评估文学中流行的估计人员在这类损失职能范围内,并且当每个估算师在实践中可能表现良好时,还提供管理层。
translated by 谷歌翻译
我们考虑了上下文匪徒的违规评估(OPE)问题,其中目标是使用日志记录策略收集的数据估计目标策略的值。 ope的最流行方法是通过组合直接方法(DM)估计和涉及逆倾向得分(IP)的校正项而获得的双重稳健(DR)估计器的变型。现有算法主要关注降低大型IPS引起的博士估算器方差的策略。我们提出了一种称为双重强大的新方法,具有信息借用和基于上下文的交换(DR-IC)估计,专注于减少偏差和方差。 DR-IC估计器用参数奖励模型替换标准DM估计器,该参数奖励模型通过依赖于IPS的相关结构从“更近的”上下文中借用信息。 DR-IC估计器还基于特定于上下文的切换规则在该修改的DM估计器和修改的DR估计器之间自适应地插值。我们对DR-IC估算员的表现提供了可证明的保证。我们还展示了DR-IC估计的卓越性能与艺术最先进的OPE算法相比,在许多基准问题上的算法相比。
translated by 谷歌翻译
Counterfactual reasoning from logged data has become increasingly important for many applications such as web advertising or healthcare. In this paper, we address the problem of learning stochastic policies with continuous actions from the viewpoint of counterfactual risk minimization (CRM). While the CRM framework is appealing and well studied for discrete actions, the continuous action case raises new challenges about modelization, optimization, and~offline model selection with real data which turns out to be particularly challenging. Our paper contributes to these three aspects of the CRM estimation pipeline. First, we introduce a modelling strategy based on a joint kernel embedding of contexts and actions, which overcomes the shortcomings of previous discretization approaches. Second, we empirically show that the optimization aspect of counterfactual learning is important, and we demonstrate the benefits of proximal point algorithms and differentiable estimators. Finally, we propose an evaluation protocol for offline policies in real-world logged systems, which is challenging since policies cannot be replayed on test data, and we release a new large-scale dataset along with multiple synthetic, yet realistic, evaluation setups.
translated by 谷歌翻译
Off-policy evaluation (OPE) attempts to predict the performance of counterfactual policies using log data from a different policy. We extend its applicability by developing an OPE method for a class of both full support and deficient support logging policies in contextual-bandit settings. This class includes deterministic bandit (such as Upper Confidence Bound) as well as deterministic decision-making based on supervised and unsupervised learning. We prove that our method's prediction converges in probability to the true performance of a counterfactual policy as the sample size increases. We validate our method with experiments on partly and entirely deterministic logging policies. Finally, we apply it to evaluate coupon targeting policies by a major online platform and show how to improve the existing policy.
translated by 谷歌翻译
本文关注的是,基于无限视野设置中预采用的观察数据,为目标策略的价值离线构建置信区间。大多数现有作品都假定不存在混淆观察到的动作的未测量变量。但是,在医疗保健和技术行业等实际应用中,这种假设可能会违反。在本文中,我们表明,使用一些辅助变量介导动作对系统动态的影响,目标策略的价值在混杂的马尔可夫决策过程中可以识别。基于此结果,我们开发了一个有效的非政策值估计器,该估计值可用于潜在模型错误指定并提供严格的不确定性定量。我们的方法是通过理论结果,从乘车共享公司获得的模拟和真实数据集证明的。python实施了建议的过程,请访问https://github.com/mamba413/cope。
translated by 谷歌翻译
Off-Policy evaluation (OPE) is concerned with evaluating a new target policy using offline data generated by a potentially different behavior policy. It is critical in a number of sequential decision making problems ranging from healthcare to technology industries. Most of the work in existing literature is focused on evaluating the mean outcome of a given policy, and ignores the variability of the outcome. However, in a variety of applications, criteria other than the mean may be more sensible. For example, when the reward distribution is skewed and asymmetric, quantile-based metrics are often preferred for their robustness. In this paper, we propose a doubly-robust inference procedure for quantile OPE in sequential decision making and study its asymptotic properties. In particular, we propose utilizing state-of-the-art deep conditional generative learning methods to handle parameter-dependent nuisance function estimation. We demonstrate the advantages of this proposed estimator through both simulations and a real-world dataset from a short-video platform. In particular, we find that our proposed estimator outperforms classical OPE estimators for the mean in settings with heavy-tailed reward distributions.
translated by 谷歌翻译
非政策评估和学习(OPE/L)使用离线观察数据来做出更好的决策,这对于在线实验有限的应用至关重要。但是,完全取决于记录的数据,OPE/L对环境分布的变化很敏感 - 数据生成环境和部署策略的差异。 \ citet {si2020distributional}提议的分布在稳健的OPE/L(Drope/L)解决此问题,但该提案依赖于逆向权重,如果估计错误和遗憾,如果倾向是非参数估计的,即使其差异是次级估计,即使是次级估计的,其估计错误和遗憾将降低。对于标准的,非体,OPE/L,这是通过双重鲁棒(DR)方法来解决的,但它们并不自然地扩展到更复杂的drop/l,涉及最糟糕的期望。在本文中,我们提出了具有KL-Divergence不确定性集的DROPE/L的第一个DR算法。为了进行评估,我们提出了局部双重稳健的drope(LDR $^2 $ ope),并表明它在弱产品速率条件下实现了半摩托效率。多亏了本地化技术,LDR $^2 $ OPE仅需要安装少量回归,就像标准OPE的DR方法一样。为了学习,我们提出了连续的双重稳健下降(CDR $^2 $ opl),并表明,在涉及连续回归的产品速率条件下,它具有$ \ Mathcal {o} \ left的快速后悔率(n^) {-1/2} \ right)$即使未知的倾向是非参数估计的。我们从经验上验证了模拟中的算法,并将结果进一步扩展到一般$ f $ divergence的不确定性集。
translated by 谷歌翻译
We consider local kernel metric learning for off-policy evaluation (OPE) of deterministic policies in contextual bandits with continuous action spaces. Our work is motivated by practical scenarios where the target policy needs to be deterministic due to domain requirements, such as prescription of treatment dosage and duration in medicine. Although importance sampling (IS) provides a basic principle for OPE, it is ill-posed for the deterministic target policy with continuous actions. Our main idea is to relax the target policy and pose the problem as kernel-based estimation, where we learn the kernel metric in order to minimize the overall mean squared error (MSE). We present an analytic solution for the optimal metric, based on the analysis of bias and variance. Whereas prior work has been limited to scalar action spaces or kernel bandwidth selection, our work takes a step further being capable of vector action spaces and metric optimization. We show that our estimator is consistent, and significantly reduces the MSE compared to baseline OPE methods through experiments on various domains.
translated by 谷歌翻译
当并非观察到所有混杂因子并获得负面对照时,我们研究因果参数的估计。最近的工作表明,这些方法如何通过两个所谓的桥梁函数来实现识别和有效估计。在本文中,我们使用阴性对照来应对因果推断的主要挑战:这些桥梁功能的识别和估计。先前的工作依赖于这些功能的完整性条件,以识别因果参数并在估计中需要进行独特性假设,并且还集中于桥梁函数的参数估计。相反,我们提供了一种新的识别策略,以避免完整性条件。而且,我们根据最小学习公式为这些功能提供新的估计量。这些估计值适合通用功能类别,例如重现Hilbert空间和神经网络。我们研究了有限样本收敛的结果,既可以估计桥梁功能本身,又要在各种假设组合下对因果参数进行最终估计。我们尽可能避免桥梁上的独特条件。
translated by 谷歌翻译
桥梁采样是一种强大的蒙特卡洛方法,用于估计标准化常数的比率。引入了各种方法以提高其效率。这些方法旨在通过对它们应用适当的转换而不更改标准化常数来增加密度之间的重叠。在本文中,我们首先给出了最佳桥梁估计器的渐近相对平方误差(RMSE)的新估计器,通过等效地估计两个密度之间的$ f $差异。然后,我们利用此框架,并根据二元式转换提出$ f $ -gan桥估计器($ f $ -GB),该框架将一个密度映射到另一个密度,并最小化最佳桥梁估计器的渐近RMSE相对于密度。通过使用$ f $ gan之间的密度之间的特定$ f $ divergence来选择这种转换。从某种意义上说,在任何给定的候选转换中,$ f $ -GB估算器可以渐近地实现比桥梁估算器低于或等于由任何其他转换的密度低的RMSE,我们显示出$ f $ -GB是最佳的。数值实验表明,$ f $ -GB在模拟和现实世界中的现有方法优于现有方法。此外,我们讨论了桥梁估计器如何自然来自$ f $ divergence估计的问题。
translated by 谷歌翻译
在因果推理和强盗文献中,基于观察数据的线性功能估算线性功能的问题是规范的。我们分析了首先估计治疗效果函数的广泛的两阶段程序,然后使用该数量来估计线性功能。我们证明了此类过程的均方误差上的非反应性上限:这些边界表明,为了获得非反应性最佳程序,应在特定加权$ l^2 $中最大程度地估算治疗效果的误差。 -规范。我们根据该加权规范的约束回归分析了两阶段的程序,并通过匹配非轴突局部局部最小值下限,在有限样品中建立了实例依赖性最优性。这些结果表明,除了取决于渐近效率方差之外,最佳的非质子风险除了取决于样本量支持的最富有函数类别的真实结果函数与其近似类别之间的加权规范距离。
translated by 谷歌翻译
三角形流量,也称为kn \“{o}的Rosenblatt测量耦合,包括用于生成建模和密度估计的归一化流模型的重要构建块,包括诸如实值的非体积保存变换模型的流行自回归流模型(真实的NVP)。我们提出了三角形流量统计模型的统计保证和样本复杂性界限。特别是,我们建立了KN的统计一致性和kullback-leibler估算器的rospblatt的kullback-leibler估计的有限样本会聚率使用实证过程理论的工具测量耦合。我们的结果突出了三角形流动下播放功能类的各向异性几何形状,优化坐标排序,并导致雅各比比流动的统计保证。我们对合成数据进行数值实验,以说明我们理论发现的实际意义。
translated by 谷歌翻译
This paper studies offline policy learning, which aims at utilizing observations collected a priori (from either fixed or adaptively evolving behavior policies) to learn an optimal individualized decision rule that achieves the best overall outcomes for a given population. Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics are lower bounded in the offline dataset; put differently, the performance of the existing methods depends on the worst-case propensity in the offline dataset. As one has no control over the data collection process, this assumption can be unrealistic in many situations, especially when the behavior policies are allowed to evolve over time with diminishing propensities for certain actions. In this paper, we propose a new algorithm that optimizes lower confidence bounds (LCBs) -- instead of point estimates -- of the policy values. The LCBs are constructed using knowledge of the behavior policies for collecting the offline data. Without assuming any uniform overlap condition, we establish a data-dependent upper bound for the suboptimality of our algorithm, which only depends on (i) the overlap for the optimal policy, and (ii) the complexity of the policy class we optimize over. As an implication, for adaptively collected data, we ensure efficient policy learning as long as the propensities for optimal actions are lower bounded over time, while those for suboptimal ones are allowed to diminish arbitrarily fast. In our theoretical analysis, we develop a new self-normalized type concentration inequality for inverse-propensity-weighting estimators, generalizing the well-known empirical Bernstein's inequality to unbounded and non-i.i.d. data.
translated by 谷歌翻译
我们探索了一个新的强盗实验模型,其中潜在的非组织序列会影响武器的性能。上下文 - 统一算法可能会混淆,而那些执行正确的推理面部信息延迟的算法。我们的主要见解是,我们称之为Deconfounst Thompson采样的算法在适应性和健壮性之间取得了微妙的平衡。它的适应性在易于固定实例中带来了最佳效率,但是在硬性非平稳性方面显示出令人惊讶的弹性,这会导致其他自适应算法失败。
translated by 谷歌翻译
反事实风险最小化是通过记录数据组成的脱机策略优化的框架,该数据由上下文,动作,倾向得分和每个样本点的奖励组成。在这项工作中,我们以此框架为基础,并为未观察到某些样本的奖励的设置提出了一种学习方法,因此记录的数据由具有未知奖励的样本子集和具有已知奖励的样本子集。此设置在许多应用领域,包括广告和医疗保健。虽然某些样本缺少奖励反馈,但可以利用未知的奖励样本来最大程度地降低风险,我们将此设置称为半遇到事实风险的最小化。为了解决这种学习问题,我们在反相反分数估计器下的真实风险中得出了新的上限。然后,我们基于这些界限,提出了一种正规化的反事实风险最小化方法,该方法仅基于已记录的未知奖励数据集;因此,这是奖励独立的。我们还提出了另一种算法,该算法基于为已记录的未知奖励数据集生成伪奖励。神经网络和基准数据集的实验结果表明,除了已记录已知的奖励数据集外,这些算法可以利用已记录的未知奖励数据集。
translated by 谷歌翻译
解决了与人类偏好的安全一致性以及学习效率之类的各种目的,越来越多的强化学习研究集中在依赖整个收益分配的风险功能上。关于\ emph {Oplicy风险评估}(OPRA)的最新工作,针对上下文匪徒引入了目标策略的收益率以及有限样本保证的一致估计量,并保证了(并同时保留所有风险)。在本文中,我们将OPRA提升到马尔可夫决策过程(MDPS),其中重要性采样(IS)CDF估计量由于有效样本量较小而遭受较长轨迹的较大差异。为了减轻这些问题,我们合并了基于模型的估计,以开发MDPS回报的CDF的第一个双重鲁棒(DR)估计器。该估计器的差异明显较小,并且在指定模型时,可以实现Cramer-Rao方差下限。此外,对于许多风险功能,下游估计值同时享有较低的偏差和较低的差异。此外,我们得出了非政策CDF和风险估计的第一个Minimax下限,这与我们的误差界限到恒定因子。最后,我们在几种不同的环境上实验表明了DR CDF估计的精度。
translated by 谷歌翻译
Reinforcement learning (RL) is one of the most vibrant research frontiers in machine learning and has been recently applied to solve a number of challenging problems. In this paper, we primarily focus on off-policy evaluation (OPE), one of the most fundamental topics in RL. In recent years, a number of OPE methods have been developed in the statistics and computer science literature. We provide a discussion on the efficiency bound of OPE, some of the existing state-of-the-art OPE methods, their statistical properties and some other related research directions that are currently actively explored.
translated by 谷歌翻译
我们研究了批量策略优化中模型选择的问题:给定固定的部分反馈数据集和$ M $ Model类,学习具有与最佳模型类的策略具有竞争力的性能的策略。通过识别任何模型选择算法应最佳地折衷的错误,以线性模型类在与线性模型类中的内容匪徒设置中的问题正式化。(1)近似误差,(2)统计复杂性,(3 )覆盖范围。前两个来源是在监督学习的模型选择中常见的,在最佳的交易中,这些属性得到了很好的研究。相比之下,第三个源是批量策略优化的唯一,并且是由于设置所固有的数据集移位。首先表明,没有批处理策略优化算法可以同时实现所有三个的保证,展示批量策略优化的困难之间的显着对比,以及监督学习中的积极结果。尽管存在这种负面结果,但我们表明,在三个错误源中的任何一个都可以实现实现剩下的两个近乎oracle不平等的算法。我们通过实验结论,证明了这些算法的功效。
translated by 谷歌翻译