智能论文笔记

Low Variance Off-policy Evaluation with State-based Importance Sampling

David M. Bossens , Philip Thomas

分类：机器学习 | 人工智能

2022-12-07

In off-policy reinforcement learning, a behaviour policy performs exploratory interactions with the environment to obtain state-action-reward samples which are then used to learn a target policy that optimises the expected return. This leads to a problem of off-policy evaluation, where one needs to evaluate the target policy from samples collected by the often unrelated behaviour policy. Importance sampling is a traditional statistical technique that is often applied to off-policy evaluation. While importance sampling estimators are unbiased, their variance increases exponentially with the horizon of the decision process due to computing the importance weight as a product of action probability ratios, yielding estimates with low accuracy for domains involving long-term planning. This paper proposes state-based importance sampling (SIS), which drops the action probability ratios of sub-trajectories with "neglible states" -- roughly speaking, those for which the chosen actions have no impact on the return estimate -- from the computation of the importance weight. Theoretical results show that this results in a reduction of the exponent in the variance upper bound as well as improving the mean squared error. An automated search algorithm based on covariance testing is proposed to identify a negligible state set which has minimal MSE when performing state-based importance sampling. Experiments are conducted on a lift domain, which include "lift states" where the action has no impact on the following state and reward. The results demonstrate that using the search algorithm, SIS yields reduced variance and improved accuracy compared to traditional importance sampling, per-decision importance sampling, and incremental importance sampling.

translated by 谷歌翻译

A Neural Network Subgrid Model of the Early Stages of Planet Formation

Thomas Pfeil , Miles Cranmer , Shirley Ho , Philip J. Armitage , Tilman Birnstiel , Hubert Klahr

分类：机器学习

2022-11-08

Planet formation is a multi-scale process in which the coagulation of $\mathrm{\mu m}$-sized dust grains in protoplanetary disks is strongly influenced by the hydrodynamic processes on scales of astronomical units ($\approx 1.5\times 10^8 \,\mathrm{km}$). Studies are therefore dependent on subgrid models to emulate the micro physics of dust coagulation on top of a large scale hydrodynamic simulation. Numerical simulations which include the relevant physical effects are complex and computationally expensive. Here, we present a fast and accurate learned effective model for dust coagulation, trained on data from high resolution numerical coagulation simulations. Our model captures details of the dust coagulation process that were so far not tractable with other dust coagulation prescriptions with similar computational efficiency.

translated by 谷歌翻译

Enforcing Delayed-Impact Fairness Guarantees

Aline Weber , Blossom Metevier , Yuriy Brun , Philip S. Thomas , Bruno Castro da Silva

分类：机器学习 | 人工智能

2022-08-24

最近的研究表明，看似公平的机器学习模型在为对人们的生活或福祉产生影响的决策提供信息（例如，涉及教育，就业和贷款的申请）可能会在长期内无意中增加社会不平等。这是因为先前的公平意识算法仅考虑静态公平限制，例如机会均等或人口统计奇偶。但是，强制执行这种类型的限制可能会导致模型对处境不利的个人和社区产生负面影响。我们介绍ELF（执行长期公平性），这是第一个分类算法，可提供高信任公平保证，以长期或延迟影响。我们证明，ELF返回不公平解决方案的概率小于用户指定的公差，并且（在轻度假设下），如果有足够的培训数据，ELF能够找到并返回公平的解决方案，如果存在一个公平的解决方案。我们通过实验表明，我们的算法可以成功缓解长期不公平。

translated by 谷歌翻译

HTML版本

Memory-Driven Text-to-Image Generation

Bowen Li , Philip H. S. Torr , Thomas Lukasiewicz

分类：计算机视觉 | 自然语言处理 | 机器学习

2022-08-15

我们为文本对图像生成引入了一种内存驱动的半参数方法，该方法基于参数和非参数技术。非参数组件是由训练集构建的图像特征的记忆库。参数组件是生成对抗网络。给定在推理时间进行新的文本描述，内存库用于选择性检索作为目标图像的基本信息提供的图像功能，从而使生成器能够产生逼真的合成结果。我们还将内容信息与语义功能一起纳入歧视器中，从而使歧视者可以做出更可靠的预测。实验结果表明，所提出的记忆驱动的半参数方法比视觉忠诚度和文本图像语义一致性都比纯粹的参数方法产生更现实的图像。

translated by 谷歌翻译

Adaptive Rollout Length for Model-Based RL Using Model-Free Deep RL

Abhinav Bhatia , Philip S. Thomas , Shlomo Zilberstein

分类：机器学习 | 人工智能

2022-06-06

基于模型的强化学习有望通过学习环境中的中间模型来预测未来的相互作用，从而从与环境的互动较少的相互作用中学习最佳政策。当预测一系列相互作用时，限制预测范围的推出长度是关键的超参数，因为预测的准确性会降低远离真实体验的区域。结果，从长远来看，从长远来看，总体上更糟糕的政策。因此，超参数提供了质量和效率之间的权衡。在这项工作中，我们将调整推出长度调整为元级的顺序决策问题的问题构成了问题，该问题优化了基于模型的强化学习所学到的最终策略，鉴于环境相互作用的固定预算通过基于反馈动态调整超参数来调整超参数。从学习过程中，例如模型的准确性和互动的其余预算。我们使用无模型的深度强化学习来解决元级决策问题，并证明我们的方法在两个众所周知的强化学习环境上优于共同的启发式基准。

translated by 谷歌翻译

Edge-Compatible Reinforcement Learning for Recommendations

James E. Kostas , Philip S. Thomas , Georgios Theocharous

分类：机器学习

2021-12-10

大多数用于边缘计算的强化学习（RL）推荐系统必须在推荐选择期间同步，或者依赖于算法的未经警告拼凑集合。在这项工作中，我们构建了异步凝固策略梯度算法\ citep {kostas2020aSynchronchronous}，为此问题提出了一个原则的解决方案。我们提出的算法类可以通过Internet分发，并实时地运行。当给定边缘无法响应具有足够速度的数据请求时，这不是问题;该算法旨在在边缘设置中函数和学习，网络问题是此设置的一部分。结果是一个原则性的理论地接地的RL算法，旨在分布在该异步环境中并学习。在这项工作中，我们详细描述了这种算法和建议的架构类，并且证明它们在异步设置中的实践中运行良好，即使网络质量降低。

translated by 谷歌翻译

SOPE: Spectrum of Off-Policy Estimators

Christina J. Yuan , Yash Chandak , Stephen Giguere , Philip S. Thomas , Scott Niekum

分类：机器学习

2021-11-06

许多连续的决策问题是使用使用其他一些策略收集的历史数据，需要使用历史数据的高赌注并要求新策略（OPE）。提供无偏估计的最常见的OPE技术之一是基于轨迹的重要性采样（是）。但是，由于轨迹的高方差是估计，最近通过了基于国家行动探索分布（SIS）的重要性采样方法。不幸的是，虽然SIS经常为长视野提供较低的方差估计，但估算状态行动分配比可能是具有挑战性的并且导致偏差估计。在本文中，我们对该偏差差异进行了新的视角，并显示了存在终点是SIS的估计频谱的存在。此外，我们还建立了这些估算器的双重强大和加权版本的频谱。我们提供了经验证据，即该频谱中的估计值可用于在IS和SIS的偏差和方差之间进行折衷，并且可以实现比两者和SIS更低的平均平方误差。

translated by 谷歌翻译

Universal Off-Policy Evaluation

Yash Chandak , Scott Niekum , Bruno Castro da Silva , Erik Learned-Miller , Emma Brunskill , Philip S. Thomas

分类：机器学习

2021-04-26

面对顺序决策问题时，能够预测如果使用新策略进行决策会发生什么会发生什么。这些预测通常必须基于在一些先前使用的决策规则下收集的数据。许多以前的方法使得这种违规（或反事实）估计的性能测量值的预期值称为返回。在本文中，我们采取了迈向普遍违规估算机（UNO）的第一步 - 为返回分配的任何参数提供截止政策估计和高信任界限。我们使用UNO来估计和同时限制均值，方差，量级/中位数，分位式范围，CVAR和返回的整个累积分布。最后，我们还在各种环境中讨论了UNO的适用性，包括完全可观察，部分可观察的（即，与未观察到的混乱），马尔可夫，非马尔可瓦尔，静止，平稳的非稳定性和离散分布转移。

translated by 谷歌翻译

Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning

Philip S. Thomas , Emma Brunskill

分类：

2016-04-04

In this paper we present a new way of predicting the performance of a reinforcement learning policy given historical data that may have been generated by a different policy. The ability to evaluate a policy from historical data is important for applications where the deployment of a bad policy can be dangerous or costly. We show empirically that our algorithm produces estimates that often have orders of magnitude lower mean squared error than existing methods-it makes more efficient use of the available data. Our new estimator is based on two advances: an extension of the doubly robust estimator (Jiang & Li, 2015), and a new way to mix between model based estimates and importance sampling based estimates.

translated by 谷歌翻译

A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation

Nikolaus Mayer , Eddy Ilg , Philip Häusser , Philipp Fischer , Daniel Cremers , Alexey Dosovitskiy , Thomas Brox

分类：

2015-12-07

Recent work has shown that optical flow estimation can be formulated as a supervised learning task and can be successfully solved with convolutional networks. Training of the so-called FlowNet was enabled by a large synthetically generated dataset. The present paper extends the concept of optical flow estimation via convolutional networks to disparity and scene flow estimation. To this end, we propose three synthetic stereo video datasets with sufficient realism, variation, and size to successfully train large networks. Our datasets are the first large-scale datasets to enable training and evaluating scene flow methods. Besides the datasets, we present a convolutional network for real-time disparity estimation that provides state-of-the-art results. By combining a flow and disparity estimation network and training it jointly, we demonstrate the first scene flow estimation with a convolutional network.

translated by 谷歌翻译