智能论文笔记

Counterfactual inference for sequential experiments

Raaz Dwivedi , Katherine Tian , Sabina Tomkins , Predrag Klasnja , Susan Murphy , Devavrat Shah

分类： (统计)机器学习 | 机器学习

2022-02-14

We consider after-study statistical inference for sequentially designed experiments wherein multiple units are assigned treatments for multiple time points using treatment policies that adapt over time. Our goal is to provide inference guarantees for the counterfactual mean at the smallest possible scale -- mean outcome under different treatments for each unit and each time -- with minimal assumptions on the adaptive treatment policy. Without any structural assumptions on the counterfactual means, this challenging task is infeasible due to more unknowns than observed data points. To make progress, we introduce a latent factor model over the counterfactual means that serves as a non-parametric generalization of the non-linear mixed effects model and the bilinear latent factor model considered in prior works. For estimation, we use a non-parametric method, namely a variant of nearest neighbors, and establish a non-asymptotic high probability error bound for the counterfactual mean for each unit and each time. Under regularity conditions, this bound leads to asymptotically valid confidence intervals for the counterfactual mean as the number of units and time points grows to $\infty$.

translated by 谷歌翻译

Doubly robust nearest neighbors in factor models

Raaz Dwivedi , Katherine Tian , Sabina Tomkins , Predrag Klasnja , Susan Murphy , Devavrat Shah

分类： (统计)机器学习 | 机器学习

2022-11-25

In this technical note, we introduce an improved variant of nearest neighbors for counterfactual inference in panel data settings where multiple units are assigned multiple treatments over multiple time points, each sampled with constant probabilities. We call this estimator a doubly robust nearest neighbor estimator and provide a high probability non-asymptotic error bound for the mean parameter corresponding to each unit at each time. Our guarantee shows that the doubly robust estimator provides a (near-)quadratic improvement in the error compared to nearest neighbor estimators analyzed in prior work for these settings.

translated by 谷歌翻译

Off-policy estimation of linear functionals: Non-asymptotic theory for semi-parametric efficiency

Wenlong Mou , Martin J. Wainwright , Peter L. Bartlett

分类： (统计)机器学习

2022-09-26

在因果推理和强盗文献中，基于观察数据的线性功能估算线性功能的问题是规范的。我们分析了首先估计治疗效果函数的广泛的两阶段程序，然后使用该数量来估计线性功能。我们证明了此类过程的均方误差上的非反应性上限：这些边界表明，为了获得非反应性最佳程序，应在特定加权$ l^2 $中最大程度地估算治疗效果的误差。 -规范。我们根据该加权规范的约束回归分析了两阶段的程序，并通过匹配非轴突局部局部最小值下限，在有限样品中建立了实例依赖性最优性。这些结果表明，除了取决于渐近效率方差之外，最佳的非质子风险除了取决于样本量支持的最富有函数类别的真实结果函数与其近似类别之间的加权规范距离。

translated by 谷歌翻译

Optimal and instance-dependent guarantees for Markovian linear stochastic approximation

Wenlong Mou , Ashwin Pananjady , Martin J. Wainwright , Peter L. Bartlett

分类：机器学习 | (统计)机器学习

2021-12-23

我们研究了随机近似程序，以便基于观察来自ergodic Markov链的长度$ n $的轨迹来求近求解$ d -dimension的线性固定点方程。我们首先表现出$ t _ {\ mathrm {mix}} \ tfrac {n}} \ tfrac {n}} \ tfrac {d}} \ tfrac {d} {n} $的非渐近性界限。$ t _ {\ mathrm {mix $是混合时间。然后，我们证明了一种在适当平均迭代序列上的非渐近实例依赖性，具有匹配局部渐近最小的限制的领先术语，包括对参数$的敏锐依赖（d，t _ {\ mathrm {mix}}） $以高阶术语。我们将这些上限与非渐近Minimax的下限补充，该下限是建立平均SA估计器的实例 - 最优性。我们通过Markov噪声的政策评估导出了这些结果的推导 - 覆盖了所有$ \ lambda \中的TD（$ \ lambda $）算法，以便[0,1）$ - 和线性自回归模型。我们的实例依赖性表征为HyperParameter调整的细粒度模型选择程序的设计开放了门（例如，在运行TD（$ \ Lambda $）算法时选择$ \ lambda $的值）。

translated by 谷歌翻译

Adaptivity and Confounding in Multi-Armed Bandit Experiments

Chao Qin , Daniel Russo

分类：机器学习 | (统计)机器学习

2022-02-18

我们探索了一个新的强盗实验模型，其中潜在的非组织序列会影响武器的性能。上下文 - 统一算法可能会混淆，而那些执行正确的推理面部信息延迟的算法。我们的主要见解是，我们称之为Deconfounst Thompson采样的算法在适应性和健壮性之间取得了微妙的平衡。它的适应性在易于固定实例中带来了最佳效率，但是在硬性非平稳性方面显示出令人惊讶的弹性，这会导致其他自适应算法失败。

translated by 谷歌翻译

Policy learning "without'' overlap: Pessimism and generalized empirical Bernstein's inequality

Ying Jin , Zhimei Ren , Zhuoran Yang , Zhaoran Wang

分类：机器学习 | (统计)机器学习

2022-12-19

This paper studies offline policy learning, which aims at utilizing observations collected a priori (from either fixed or adaptively evolving behavior policies) to learn an optimal individualized decision rule that achieves the best overall outcomes for a given population. Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics are lower bounded in the offline dataset; put differently, the performance of the existing methods depends on the worst-case propensity in the offline dataset. As one has no control over the data collection process, this assumption can be unrealistic in many situations, especially when the behavior policies are allowed to evolve over time with diminishing propensities for certain actions. In this paper, we propose a new algorithm that optimizes lower confidence bounds (LCBs) -- instead of point estimates -- of the policy values. The LCBs are constructed using knowledge of the behavior policies for collecting the offline data. Without assuming any uniform overlap condition, we establish a data-dependent upper bound for the suboptimality of our algorithm, which only depends on (i) the overlap for the optimal policy, and (ii) the complexity of the policy class we optimize over. As an implication, for adaptively collected data, we ensure efficient policy learning as long as the propensities for optimal actions are lower bounded over time, while those for suboptimal ones are allowed to diminish arbitrarily fast. In our theoretical analysis, we develop a new self-normalized type concentration inequality for inverse-propensity-weighting estimators, generalizing the well-known empirical Bernstein's inequality to unbounded and non-i.i.d. data.

translated by 谷歌翻译

Policy design in experiments with unknown interference

Davide Viviano

分类：机器学习

2020-11-16

本文提出了一种估计溢出效应存在福利最大化政策的实验设计。我考虑一个设置在其中组织成一个有限数量的大型群集，并在每个群集中以不观察到的方式交互。作为第一种贡献，我介绍了一个单波实验，以估计治疗概率的变化的边际效应，以考虑到溢出率，并测试政策最优性。该设计在群集中独立地随机化处理，并诱导局部扰动到对簇成对的治疗概率。使用估计的边际效应，我构建了对定期治疗分配规则最大化福利的实际测试，并且我表征了其渐近性质。该想法是，研究人员应报告对福利最大化政策的边际效应和测试的估计：边际效应表明福利改善的方向，并提供了关于是否值得进行额外实验以估计估计福利改善的证据治疗分配。作为第二种贡献，我设计了多波实验来估计治疗分配规则并最大化福利。我获得了小型样本保证，最大可获得的福利和估计政策（遗憾）评估的福利之间的差异。这种保证的必要性是，遗憾在迭代和集群的数量中线性会聚到零。校准在信息扩散和现金转移方案上校准的模拟表明，该方法导致了显着的福利改进。

translated by 谷歌翻译

A Non-Asymptotic Framework for Approximate Message Passing in Spiked Models

Gen Li , Yuting Wei

分类：机器学习 | (统计)机器学习

2022-08-05

近似消息传递（AMP）是解决高维统计问题的有效迭代范式。但是，当迭代次数超过$ o \ big（\ frac {\ log n} {\ log log \ log \ log n} \时big）$（带有$ n $问题维度）。为了解决这一不足，本文开发了一个非吸附框架，用于理解峰值矩阵估计中的AMP。基于AMP更新的新分解和可控的残差项，我们布置了一个分析配方，以表征在存在独立初始化的情况下AMP的有限样本行为，该过程被进一步概括以进行光谱初始化。作为提出的分析配方的两个具体后果：（i）求解$ \ mathbb {z} _2 $同步时，我们预测了频谱初始化AMP的行为，最高为$ o \ big（\ frac {n} {\ mathrm {\ mathrm { poly} \ log n} \ big）$迭代，表明该算法成功而无需随后的细化阶段（如最近由\ citet {celentano2021local}推测）; （ii）我们表征了稀疏PCA中AMP的非反应性行为（在尖刺的Wigner模型中），以广泛的信噪比。

translated by 谷歌翻译

Optimal variance-reduced stochastic approximation in Banach spaces

Wenlong Mou , Koulik Khamaru , Martin J. Wainwright , Peter L. Bartlett , Michael I. Jordan

分类：机器学习 | (统计)机器学习

2022-01-21

We study the problem of estimating the fixed point of a contractive operator defined on a separable Banach space. Focusing on a stochastic query model that provides noisy evaluations of the operator, we analyze a variance-reduced stochastic approximation scheme, and establish non-asymptotic bounds for both the operator defect and the estimation error, measured in an arbitrary semi-norm. In contrast to worst-case guarantees, our bounds are instance-dependent, and achieve the local asymptotic minimax risk non-asymptotically. For linear operators, contractivity can be relaxed to multi-step contractivity, so that the theory can be applied to problems like average reward policy evaluation problem in reinforcement learning. We illustrate the theory via applications to stochastic shortest path problems, two-player zero-sum Markov games, as well as policy evaluation and $Q$-learning for tabular Markov decision processes.

translated by 谷歌翻译

Estimation and Inference on Heterogeneous Treatment Effects in High-Dimensional Dynamic Panels under Weak Dependence

Vira Semenova , Matt Goldman , Victor Chernozhukov , Matt Taddy

分类： (统计)机器学习

2017-12-28

This paper provides estimation and inference methods for a conditional average treatment effects (CATE) characterized by a high-dimensional parameter in both homogeneous cross-sectional and unit-heterogeneous dynamic panel data settings. In our leading example, we model CATE by interacting the base treatment variable with explanatory variables. The first step of our procedure is orthogonalization, where we partial out the controls and unit effects from the outcome and the base treatment and take the cross-fitted residuals. This step uses a novel generic cross-fitting method we design for weakly dependent time series and panel data. This method "leaves out the neighbors" when fitting nuisance components, and we theoretically power it by using Strassen's coupling. As a result, we can rely on any modern machine learning method in the first step, provided it learns the residuals well enough. Second, we construct an orthogonal (or residual) learner of CATE -- the Lasso CATE -- that regresses the outcome residual on the vector of interactions of the residualized treatment with explanatory variables. If the complexity of CATE function is simpler than that of the first-stage regression, the orthogonal learner converges faster than the single-stage regression-based learner. Third, we perform simultaneous inference on parameters of the CATE function using debiasing. We also can use ordinary least squares in the last two steps when CATE is low-dimensional. In heterogeneous panel data settings, we model the unobserved unit heterogeneity as a weakly sparse deviation from Mundlak (1978)'s model of correlated unit effects as a linear function of time-invariant covariates and make use of L1-penalization to estimate these models. We demonstrate our methods by estimating price elasticities of groceries based on scanner data. We note that our results are new even for the cross-sectional (i.i.d) case.

translated by 谷歌翻译

Estimation and Inference of Heterogeneous Treatment Effects using Random Forests

Stefan Wager , Susan Athey

分类：

2015-10-14

Many scientific and engineering challenges-ranging from personalized medicine to customized marketing recommendations-require an understanding of treatment effect heterogeneity. In this paper, we develop a non-parametric causal forest for estimating heterogeneous treatment effects that extends Breiman's widely used random forest algorithm. In the potential outcomes framework with unconfoundedness, we show that causal forests are pointwise consistent for the true treatment effect, and have an asymptotically Gaussian and centered sampling distribution. We also discuss a practical method for constructing asymptotic confidence intervals for the true treatment effect that are centered at the causal forest estimates. Our theoretical results rely on a generic Gaussian theory for a large family of random forest algorithms. To our knowledge, this is the first set of results that allows any type of random forest, including classification and regression forests, to be used for provably valid statistical inference. In experiments, we find causal forests to be substantially more powerful than classical methods based on nearest-neighbor matching, especially in the presence of irrelevant covariates.

translated by 谷歌翻译

The Projected Covariance Measure for assumption-lean variable significance testing

Anton Rask Lundborg , Ilmun Kim , Rajen D. Shah , Richard J. Samworth

分类： (统计)机器学习

2022-11-03

Testing the significance of a variable or group of variables $X$ for predicting a response $Y$, given additional covariates $Z$, is a ubiquitous task in statistics. A simple but common approach is to specify a linear model, and then test whether the regression coefficient for $X$ is non-zero. However, when the model is misspecified, the test may have poor power, for example when $X$ is involved in complex interactions, or lead to many false rejections. In this work we study the problem of testing the model-free null of conditional mean independence, i.e. that the conditional mean of $Y$ given $X$ and $Z$ does not depend on $X$. We propose a simple and general framework that can leverage flexible nonparametric or machine learning methods, such as additive models or random forests, to yield both robust error control and high power. The procedure involves using these methods to perform regressions, first to estimate a form of projection of $Y$ on $X$ and $Z$ using one half of the data, and then to estimate the expected conditional covariance between this projection and $Y$ on the remaining half of the data. While the approach is general, we show that a version of our procedure using spline regression achieves what we show is the minimax optimal rate in this nonparametric testing problem. Numerical experiments demonstrate the effectiveness of our approach both in terms of maintaining Type I error control, and power, compared to several existing approaches.

translated by 谷歌翻译

Perturbation Analysis of Randomized SVD and its Applications to High-dimensional Statistics

Yichi Zhang , Minh Tang

分类： (统计)机器学习

2022-03-19

随机奇异值分解（RSVD）是用于计算大型数据矩阵截断的SVD的一类计算算法。给定A $ n \ times n $对称矩阵$ \ mathbf {m} $，原型RSVD算法输出通过计算$ \ mathbf {m mathbf {m} $的$ k $引导singular vectors的近似m}^{g} \ mathbf {g} $;这里$ g \ geq 1 $是一个整数，$ \ mathbf {g} \ in \ mathbb {r}^{n \ times k} $是一个随机的高斯素描矩阵。在本文中，我们研究了一般的“信号加上噪声”框架下的RSVD的统计特性，即，观察到的矩阵$ \ hat {\ mathbf {m}} $被认为是某种真实但未知的加法扰动信号矩阵$ \ mathbf {m} $。我们首先得出$ \ ell_2 $（频谱规范）和$ \ ell_ {2 \ to \ infty} $（最大行行列$ \ ell_2 $ norm）$ \ hat {\ hat {\ Mathbf {M}} $和信号矩阵$ \ Mathbf {M} $的真实单数向量。这些上限取决于信噪比（SNR）和功率迭代$ g $的数量。观察到一个相变现象，其中较小的SNR需要较大的$ g $值以保证$ \ ell_2 $和$ \ ell_ {2 \ to \ fo \ infty} $ distances的收敛。我们还表明，每当噪声矩阵满足一定的痕量生长条件时，这些相变发生的$ g $的阈值都会很清晰。最后，我们得出了近似奇异向量的行波和近似矩阵的进入波动的正常近似。我们通过将RSVD的几乎最佳性能保证在应用于三个统计推断问题的情况下，即社区检测，矩阵完成和主要的组件分析，并使用缺失的数据来说明我们的理论结果。

translated by 谷歌翻译

Sequential estimation of quantiles with applications to A/B-testing and best-arm identification

Steven R. Howard , Aaditya Ramdas

分类： (统计)机器学习

2019-06-24

我们提出了置信度序列 - 置信区间序列，其均匀地随时间均匀 - 用于基于I.I.D的流的完整，完全有序集中的任何分布的量级。观察。我们提供用于跟踪固定定量的方法并同时跟踪所有定量。具体而言，我们提供具有小常数的明确表达式，其宽度以尽可能快的$ \ SQRT {t} \ log \ log t} $率，以及实证分布函数的非渐近浓度不等式以相同的速率均匀地持续持续。后者加强了Smirnov迭代对数的实证过程法，延长了DVORETZKY-KIEFER-WOLFOITZ不等式以均匀地保持一段时间。我们提供了一种新的算法和样本复杂性，用于在多武装强盗框架中选择具有大约最佳定量的臂。在仿真中，我们的方法需要比现有方法更少五到五十的样品。

translated by 谷歌翻译

Projected State-action Balancing Weights for Offline Reinforcement Learning

Jiayi Wang , Zhengling Qi , Raymond K. W. Wong

分类：机器学习

2021-09-10

离线政策评估（OPE）被认为是强化学习（RL）的基本且具有挑战性的问题。本文重点介绍了基于从无限 - 马尔可夫决策过程的框架下从可能不同策略生成的预收集的数据的目标策略的价值估计。由RL最近开发的边际重要性采样方法和因果推理中的协变量平衡思想的动机，我们提出了一个新颖的估计器，具有大约投影的国家行动平衡权重，以进行策略价值估计。我们获得了这些权重的收敛速率，并表明拟议的值估计量在技术条件下是半参数有效的。就渐近学而言，我们的结果比例均以每个轨迹的轨迹数量和决策点的数量进行扩展。因此，当决策点数量分歧时，仍然可以使用有限的受试者实现一致性。此外，我们开发了一个必要且充分的条件，以建立贝尔曼操作员在政策环境中的适当性，这表征了OPE的困难，并且可能具有独立的利益。数值实验证明了我们提出的估计量的有希望的性能。

translated by 谷歌翻译

Policy evaluation from a single path: Multi-step methods, mixing and mis-specification

Yaqi Duan , Martin J. Wainwright

分类： (统计)机器学习 | 机器学习

2022-11-07

We study non-parametric estimation of the value function of an infinite-horizon $\gamma$-discounted Markov reward process (MRP) using observations from a single trajectory. We provide non-asymptotic guarantees for a general family of kernel-based multi-step temporal difference (TD) estimates, including canonical $K$-step look-ahead TD for $K = 1, 2, \ldots$ and the TD$(\lambda)$ family for $\lambda \in [0,1)$ as special cases. Our bounds capture its dependence on Bellman fluctuations, mixing time of the Markov chain, any mis-specification in the model, as well as the choice of weight function defining the estimator itself, and reveal some delicate interactions between mixing time and model mis-specification. For a given TD method applied to a well-specified model, its statistical error under trajectory data is similar to that of i.i.d. sample transition pairs, whereas under mis-specification, temporal dependence in data inflates the statistical error. However, any such deterioration can be mitigated by increased look-ahead. We complement our upper bounds by proving minimax lower bounds that establish optimality of TD-based methods with appropriately chosen look-ahead and weighting, and reveal some fundamental differences between value function estimation and ordinary non-parametric regression.

translated by 谷歌翻译

Online Statistical Inference for Matrix Contextual Bandit

Qiyu Han , Will Wei Sun , Yichen Zhang

分类： (统计)机器学习 | 机器学习

2022-12-21

Contextual bandit has been widely used for sequential decision-making based on the current contextual information and historical feedback data. In modern applications, such context format can be rich and can often be formulated as a matrix. Moreover, while existing bandit algorithms mainly focused on reward-maximization, less attention has been paid to the statistical inference. To fill in these gaps, in this work we consider a matrix contextual bandit framework where the true model parameter is a low-rank matrix, and propose a fully online procedure to simultaneously make sequential decision-making and conduct statistical inference. The low-rank structure of the model parameter and the adaptivity nature of the data collection process makes this difficult: standard low-rank estimators are not fully online and are biased, while existing inference approaches in bandit algorithms fail to account for the low-rankness and are also biased. To address these, we introduce a new online doubly-debiasing inference procedure to simultaneously handle both sources of bias. In theory, we establish the asymptotic normality of the proposed online doubly-debiased estimator and prove the validity of the constructed confidence interval. Our inference results are built upon a newly developed low-rank stochastic gradient descent estimator and its non-asymptotic convergence result, which is also of independent interest.

translated by 谷歌翻译

Alternating minimization for generalized rank one matrix sensing: Sharp predictions from a random initialization

Kabir Aladin Chandrasekher , Mengqi Lou , Ashwin Pananjady

分类： (统计)机器学习

2022-07-20

我们考虑估计与I.I.D的排名$ 1 $矩阵因素的问题。高斯，排名$ 1 $的测量值，这些测量值非线性转化和损坏。考虑到非线性的两种典型选择，我们研究了从随机初始化开始的此非convex优化问题的天然交流更新规则的收敛性能。我们通过得出确定性递归，即使在高维问题中也是准确的，我们显示出算法的样本分割版本的敏锐收敛保证。值得注意的是，虽然无限样本的种群更新是非信息性的，并提示单个步骤中的精确恢复，但算法 - 我们的确定性预测 - 从随机初始化中迅速地收敛。我们尖锐的非反应分析也暴露了此问题的其他几种细粒度，包括非线性和噪声水平如何影响收敛行为。从技术层面上讲，我们的结果可以通过证明我们的确定性递归可以通过我们的确定性顺序来预测我们的确定性序列，而当每次迭代都以$ n $观测来运行时，我们的确定性顺序可以通过$ n^{ - 1/2} $的波动。我们的技术利用了源自有关高维$ m $估计文献的遗留工具，并为通过随机数据的其他高维优化问题的随机初始化而彻底地分析了高阶迭代算法的途径。

translated by 谷歌翻译

Optimal No-regret Learning in Repeated First-price Auctions

Yanjun Han , Zhengyuan Zhou , Tsachy Weissman

分类：机器学习 | (统计)机器学习

2020-03-22

我们通过审查反馈重复进行一定的第一价格拍卖来研究在线学习，在每次拍卖结束时，出价者只观察获胜的出价，学会了适应性地出价，以最大程度地提高她的累积回报。为了实现这一目标，投标人面临着一个具有挑战性的困境：如果她赢得了竞标 - 获得正收益的唯一方法 - 然后她无法观察其他竞标者的最高竞标，我们认为我们认为这是从中汲取的。一个未知的分布。尽管这一困境让人联想到上下文强盗中的探索探索折衷权，但现有的UCB或汤普森采样算法无法直接解决。在本文中，通过利用第一价格拍卖的结构属性，我们开发了第一个实现$ o（\ sqrt {t} \ log^{2.5} t）$ hearry bund的第一个学习算法（\ sqrt {t} \ log^{2.5} t），这是最小值的最低$ $ \ log $因素，当投标人的私人价值随机生成时。我们这样做是通过在一系列问题上提供算法，称为部分有序的上下文匪徒，该算法将图形反馈跨动作，跨环境跨上下文进行结合，以及在上下文中的部分顺序。我们通过表现出一个奇怪的分离来确定该框架的优势和劣势，即在随机环境下几乎可以独立于动作/背景规模的遗憾，但是在对抗性环境下是不可能的。尽管这一通用框架有限制，但我们进一步利用了第一价格拍卖的结构，并开发了一种学习算法，该算法在存在对手生成的私有价值的情况下，在存在的情况下可以有效地运行样本（并有效地计算）。我们建立了一个$ o（\ sqrt {t} \ log^3 t）$遗憾，以此为此算法，因此提供了对第一价格拍卖的最佳学习保证的完整表征。

translated by 谷歌翻译

Debiasing In-Sample Policy Performance for Small-Data, Large-Scale Optimization

Vishal Gupta , Michael Huang , Paat Rusmevichientong

分类：机器学习 | (统计)机器学习

2021-07-26

由于在数据稀缺的设置中，交叉验证的性能不佳，我们提出了一个新颖的估计器，以估计数据驱动的优化策略的样本外部性能。我们的方法利用优化问题的灵敏度分析来估计梯度关于数据中噪声量的最佳客观值，并利用估计的梯度将策略的样本中的表现为依据。与交叉验证技术不同，我们的方法避免了为测试集牺牲数据，在训练和因此非常适合数据稀缺的设置时使用所有数据。我们证明了我们估计量的偏见和方差范围，这些问题与不确定的线性目标优化问题，但已知的，可能是非凸的，可行的区域。对于更专业的优化问题，从某种意义上说，可行区域“弱耦合”，我们证明结果更强。具体而言，我们在估算器的错误上提供明确的高概率界限，该估计器在策略类别上均匀地保持，并取决于问题的维度和策略类的复杂性。我们的边界表明，在轻度条件下，随着优化问题的尺寸的增长，我们的估计器的误差也会消失，即使可用数据的量仍然很小且恒定。说不同的是，我们证明我们的估计量在小型数据中的大规模政权中表现良好。最后，我们通过数值将我们提出的方法与最先进的方法进行比较，通过使用真实数据调度紧急医疗响应服务的案例研究。我们的方法提供了更准确的样本外部性能估计，并学习了表现更好的政策。

translated by 谷歌翻译