We consider the classical multi-armed bandit problem with Markovian rewards. When played an arm changes its state in a Markovian fashion while it remains frozen when not played. The player receives a state-dependent reward each time it plays an arm. The number of states and the state transition probabilities of an arm are unknown to the player. The player's objective is to maximize its long-term total reward by learning the best arm over time. We show that under certain conditions on the state transition probabilities of the arms, a sample mean based index policy achieves logarithmic regret uniformly over the total number of trials. The result shows that sample mean based index policies can be applied to learning problems under the rested Markovian bandit model without loss of optimality in the order. Moreover, comparision between Anantharam's index policy and UCB shows that by choosing a small exploration parameter UCB can have a smaller regret than Anantharam's index policy.
translated by 谷歌翻译
We study a Bayesian multi-armed bandit (MAB) setting in which a principal seeks to maximize the sum of expected time-discounted rewards obtained by pulling arms, when the arms are actually pulled by selfish and myopic individuals. Since such individuals pull the arm with highest expected posterior reward (i.e., they always exploit and never explore), the principal must incentivize them to explore by offering suitable payments. Among others, this setting models crowdsourced information discovery and funding agencies incentivizing scientists to perform high-risk, high-reward research. We explore the tradeoff between the principal's total expected time-discounted incentive payments, and the total time-discounted rewards realized. Specifically, with a time-discount factor γ ∈ (0, 1), let OPT denote the total expected time-discounted reward achievable by a principal who pulls arms directly in a MAB problem, without having to incentivize selfish agents. We call a pair (ρ, b) ∈ [0, 1] 2 consisting of a reward ρ and payment b achievable if for every MAB instance, using expected time-discounted payments of at most b·OPT, the principal can guarantee an expected time-discounted reward of at least ρ·OPT. Our main result is an essentially complete characterization of achievable (payment, reward) pairs: if √ b+ √ 1 − ρ > √ γ, then (ρ, b) is achievable, and if √ b + √ 1 − ρ < √ γ, then (ρ, b) is not achievable. In proving this characterization, we analyze so-called time-expanded policies, which in each step let the agents choose myopically with some probability p, and incentivize them to choose "optimally" with probability 1 − p. The analysis of time-expanded policies leads to a question that may be of independent interest: If the same MAB instance (without selfish agents) is considered under two different time-discount rates γ > η, how small can the ratio of OPTη to OPTγ be? We give a complete answer to this question, showing that OPTη ≥ (1−γ) 2 (1−η) 2 · OPTγ , and that this bound is tight.
translated by 谷歌翻译
In this paper we consider the stochastic multi-armed bandit with metric switching costs. Given a set of locations (arms) in a metric space and prior information about the reward available at these locations, cost of getting a sample/play at every location and rules to update the prior based on samples/plays, the task is to maximize a certain objective function constrained to a distance cost of L and cost of plays C. This fundamental problem models several stochastic optimization problems in robot navigation, sensor networks, labor economics, etc. In this paper we consider two natural objective functions-future utilization and finite horizon. We develop a common duality-based framework to provide the first approximation algorithms in the metric switching cost model, the approximation ratios being small constant factors. Since both problems are Max-SNP hard, this result is the best possible. We also show an "adaptivity" result, namely, there exists a policy which orders the arms and visits them in that fixed order without revisiting any arm and this policy gives at least Ω(1) fraction reward of the fully adaptive policy. The overall technique involves a subtle variant of the widely used Gittins index, and the ensuing structural characterizations will be independently of interest in the context of bandit problems with complicated side-constraints. * This combines two papers [25, 26] appearing in STOC 2007 and ICALP 2009 conferences respectively.
translated by 谷歌翻译
We formulate the following combinatorial multi-armed bandit (MAB) problem: There are random variables with unknown mean that are each instantiated in an i.i.d. fashion over time. At each time multiple random variables can be selected, subject to an arbitrary constraint on weights associated with the selected variables. All of the selected individual random variables are observed at that time, and a linearly weighted combination of these selected variables is yielded as the reward. The goal is to find a policy that minimizes regret, defined as the difference between the reward obtained by a genie that knows the mean of each random variable, and that obtained by the given policy. This formulation is broadly applicable and useful for stochastic online versions of many interesting tasks in networks that can be formulated as tractable combinatorial optimization problems with linear objective functions, such as maximum weighted matching, shortest path, and minimum spanning tree computations. Prior work on multi-armed bandits with multiple plays cannot be applied to this formulation because of the general nature of the constraint. On the other hand, the mapping of all feasible combinations to arms allows for the use of prior work on MAB with single-play, but results in regret, storage, and computation growing exponentially in the number of unknown variables. We present new efficient policies for this problem that are shown to achieve regret that grows logarithmically with time, and polynomially in the number of unknown variables. Furthermore, these policies only require storage that grows linearly in the number of unknown parameters. For problems where the underlying deterministic problem is tractable, these policies further require only polynomial computation. For computationally intractable problems, we also present results on a different notion of regret that is suitable when a polynomial-time approximation algorithm is used. Index Terms-Combinatorial network optimization, multi-armed bandits (MABs), online learning.
translated by 谷歌翻译
In the stochastic knapsack problem, we are given a knapsack of size B, and a set of jobs whose sizes and rewards are drawn from a known probability distribution. However, the only way to know the actual size and reward is to schedule the job-when it completes, we get to know these values. How should we schedule jobs to maximize the expected total reward? We know constant-factor approximations for this problem when we assume that rewards and sizes are independent random variables, and that we cannot prematurely cancel jobs after we schedule them. What can we say when either or both of these assumptions are changed? The stochastic knapsack problem is of interest in its own right, but techniques developed for it are applicable to other stochastic packing problems. Indeed, ideas for this problem have been useful for budgeted learning problems, where one is given several arms which evolve in a specified stochastic fashion with each pull, and the goal is to pull the arms a total of B times to maximize the reward obtained. Much recent work on this problem focus on the case when the evolution of the arms follows a martingale, i.e., when the expected reward from the future is the same as the reward at the current state. What can we say when the rewards do not form a martingale? In this paper, we give constant-factor approximation algorithms for the stochastic knapsack problem with correlations and/or cancellations, and also for budgeted learning problems where the martingale condition is not satisfied, using similar ideas. Indeed, we can show that previously proposed linear programming relax-ations for these problems have large integrality gaps. We propose new time-indexed LP relaxations; using a decomposition and "gap-filling" approach, we convert these fractional solutions to distributions over strategies , and then use the LP values and the time ordering information from these strategies to devise a randomized adaptive scheduling algorithm. We hope our LP formulation and decomposition methods may provide a new way to address other correlated bandit problems with more general contexts. *
translated by 谷歌翻译
We consider the problem of distributed online learning with multiple players in multi-armed bandits (MAB) models. Each player can pick among multiple arms. When a player picks an arm, it gets a reward. We consider both i.i.d. reward model and Markovian reward model. In the i.i.d. model each arm is modelled as an i.i.d. process with an unknown distribution with an unknown mean. In the Markovian model, each arm is modelled as a finite, irreducible, aperiodic and reversible Markov chain with an unknown probability transition matrix and stationary distribution. The arms give different rewards to different players. If two players pick the same arm, there is a "collision", and neither of them get any reward. There is no dedicated control channel for coordination or communication among the players. Any other communication between the users is costly and will add to the regret. We propose an online index-based distributed learning policy called dUCB 4 algorithm that trades off exploration v. exploitation in the right way, and achieves expected regret that grows at most as near-O(log 2 T). The motivation comes from opportunistic spectrum access by multiple secondary users in cognitive radio networks wherein they must pick among various wireless channels that look different to different users. This is the first distributed learning algorithm for multi-player MABs to the best of our knowledge. Index Terms Distributed adaptive control, multi-armed bandit, online learning, multi-agent systems.
translated by 谷歌翻译
We incorporate statistical confidence intervals in both the multi-armed bandit and the reinforcement learning problems. In the bandit problem we show that given n arms, it suffices to pull the arms a total of O (n/ε 2) log(1/δ) times to find an ε-optimal arm with probability of at least 1 − δ. This bound matches the lower bound of Mannor and Tsitsiklis (2004) up to constants. We also devise action elimination procedures in reinforcement learning algorithms. We describe a framework that is based on learning the confidence interval around the value function or the Q-function and eliminating actions that are not optimal (with high probability). We provide a model-based and a model-free variants of the elimination method. We further derive stopping conditions guaranteeing that the learned policy is approximately optimal with high probability. Simulations demonstrate a considerable speedup and added robustness over ε-greedy Q-learning.
translated by 谷歌翻译
We study the problem of learning near-optimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples. These "PAC-MDP" algorithms include the well-known E 3 and R-MAX algorithms as well as the more recent Delayed Q-learning algorithm. We summarize the current state-of-the-art by presenting bounds for the problem in a unified theoretical framework. A more refined analysis for upper and lower bounds is presented to yield insight into the differences between the model-free Delayed Q-learning and the model-based R-MAX.
translated by 谷歌翻译
We consider the problem of learning in single-player and multiplayer multi-armed bandit models. Bandit problems are classes of online learning problems that capture exploration versus exploitation tradeoffs. In a multi-armed bandit model, players can pick among many arms, and each play of an arm generates an i.i.d. reward from an unknown distribution. The objective is to design a policy that maximizes the expected reward over a time horizon for a single player setting and the sum of expected rewards for the multiplayer setting. In the multiplayer setting, arms may give different rewards to different players. There is no separate channel for coordination among the players. Any attempt at communication is costly and adds to regret. We propose two decentralizable policies, E 3 (E-cubed) and E 3-TS, that can be used in both single player and multiplayer settings. These policies are shown to yield expected regret that grows at most as O(log 1+δ T) (and O(log T) under some assumption). It is well known that O(log T) is the lower bound on the rate of growth of regret even in a centralized case. The proposed algorithms improve on prior work where regret grew at O(log 2 T). More fundamentally, these policies address the question of additional cost incurred in decentralized online learning, suggesting that there is at most an δ-factor cost in terms of order of regret. This solves a problem of relevance in many domains and had been open for a while.
translated by 谷歌翻译
We present a formal model of human decision-making in explore-exploit tasks using the context of multi-armed bandit problems, where the decision-maker must choose among multiple options with uncertain rewards. We address the standard multi-armed bandit problem, the multi-armed bandit problem with transition costs, and the multi-armed bandit problem on graphs. We focus on the case of Gaussian rewards in a setting where the decision-maker uses Bayesian inference to estimate the reward values. We model the decision-maker's prior knowledge with the Bayesian prior on the mean reward. We develop the upper credible limit (UCL) algorithm for the standard multi-armed bandit problem and show that this deterministic algorithm achieves logarithmic cumulative expected regret, which is optimal performance for uninformative priors. We show how good priors and good assumptions on the correlation structure among arms can greatly enhance decision-making performance, even over short time horizons. We extend to the stochastic UCL algorithm and draw several connections to human decision-making behavior. We present empirical data from human experiments and show that human performance is efficiently captured by the stochastic UCL algorithm with appropriate parameters. For the multi-armed bandit problem with transition costs and the multi-armed bandit problem on graphs, we generalize the UCL algorithm to the block UCL algorithm and the graphical block UCL algorithm, respectively. We show that these algorithms also achieve logarithmic cumulative expected regret and require a sub-logarithmic expected number of transitions among arms. We further illustrate the performance of these algorithms with numerical examples. NB: Appendix G included in this version details minor modifications that correct for an oversight in the previously-published proofs. The remainder of the text reflects the published work.
translated by 谷歌翻译
我们研究了汤普森采样(TS)方法在随机组合多臂强盗(CMAB)框架中的应用。我们分析了一般CMAB的标准TS算法,并获得了一般CMAB的$ O(m \ log T / \ Delta _ {\ min})$的第一个分布依赖的后悔界限,其中$ m $是武器的数量, $ T $是时间范围,$ \ Delta _ {\ min} $是最优解决方案的预期奖励与任何非最佳解决方案之间的最小差距。我们还表明,即使是MAB问题,也不能在TS算法中使用近似oracle。然后我们将分析扩展到拟阵强盗,这是CMAB的一个特例,我们可以为此避免跨越武器的独立性假设并实现更好的遗憾。最后,我们使用一些实验来显示CUCB和CTS算法的遗憾的比较。
translated by 谷歌翻译
We present the first approximation algorithms for a large class of budgeted learning problems. One classic example of the above is the budgeted multi-armed bandit problem. In this problem each arm of the bandit has an unknown reward distribution on which a prior is specified as input. The knowledge about the underlying distribution can be refined in the exploration phase by playing the arm and observing the rewards. However, there is a budget on the total number of plays allowed during exploration. After this exploration phase, the arm with the highest (posterior) expected reward is chosen for exploitation. The goal is to design the adaptive exploration phase subject to a budget constraint on the number of plays, in order to maximize the expected reward of the arm chosen for exploitation. While this problem is reasonably well understood in the infinite horizon setting or regret bounds, the budgeted version of the problem is NP-Hard. For this problem, and several generalizations, we provide approximate policies that achieve a reward within constant factor of the reward optimal policy. Our algorithms use a novel linear program rounding technique based on stochastic packing.
translated by 谷歌翻译
Sequential decision problems are often approximately solvable by simulating possible future action sequences. Metalevel decision procedures have been developed for selecting which action sequences to simulate, based on estimating the expected improvement in decision quality that would result from any particular simulation; an example is the recent work on using bandit algorithms to control Monte Carlo tree search in the game of Go. In this paper we develop a theoretical basis for metalevel decisions in the statistical framework of Bayesian selection problems, arguing (as others have done) that this is more appropriate than the bandit framework. We derive a number of basic results applicable to Monte Carlo selection problems, including the first finite sampling bounds for optimal policies in certain cases; we also provide a simple counterexample to the intuitive conjecture that an optimal policy will necessarily reach a decision in all cases. We then derive heuristic approximations in both Bayesian and distribution-free settings and demonstrate their superiority to bandit-based heuristics in one-shot decision problems and in Go.
translated by 谷歌翻译
Multi-armed bandit problems (MABPs) are a special type of optimal control problem well suited to model resource allocation under uncertainty in a wide variety of contexts. Since the first publication of the optimal solution of the classic MABP by a dynamic index rule, the bandit literature quickly diversified and emerged as an active research topic. Across this literature , the use of bandit models to optimally design clinical trials became a typical motivating application, yet little of the resulting theory has ever been used in the actual design and analysis of clinical trials. To this end, we review two MABP decision-theoretic approaches to the optimal allocation of treatments in a clinical trial: the infinite-horizon Bayesian Bernoulli MABP and the finite-horizon variant. These models possess distinct theoretical properties and lead to separate allocation rules in a clinical trial design context. We evaluate their performance compared to other allocation rules, including fixed randomization. Our results indicate that bandit approaches offer significant advantages, in terms of assigning more patients to better treatments, and severe limitations, in terms of their resulting statistical power. We propose a novel bandit-based patient allocation rule that overcomes the issue of low power, thus removing a potential barrier for their use in practice.
translated by 谷歌翻译
受到AlphaGo Zero(AGZ)成功的启发,它利用蒙特卡罗树搜索(MCTS)和神经网络监督学习来学习最优政策和价值功能,在这项工作中,我们专注于正式建立这样一种方法确实找到了渐近的最优政策,以及在此过程中建立非渐近保证。我们将重点关注无限期贴现马尔可夫决策过程以确定结果。首先,它需要在文献中建立MCTS声称的属性,对于任何给定的查询状态,MCTS为具有足够模拟MDP步骤的状态提供近似值函数。我们提供了非渐近分析,通过分析非固定多臂匪装置来建立这种性质。我们的证据表明MCTS需要利用多项式而不是对数“上置信度限制”来建立其期望的性能 - 有趣的是,AGZ选择这样的多项式约束。使用它作为构建块,结合最近邻监督学习,我们认为MCTS充当“政策改进”运营商;它具有自然的“自举”属性,可以迭代地改进所有状态的值函数逼近,这是由于与超级学习相结合,尽管仅在有限多个状态下进行评估。实际上,我们建立了学习$ \ _ \ _ \ _ \ _ \ _ \ _间/ $ inform中值函数的$ \ varepsilon $近似值,MCTS与最近邻居相结合需要samplesscaling为$ \ widetilde {O} \ big(\ varepsilon ^ { - (d + 4)} \ big)$,其中$ d $是状态空间的维度。这几乎是最优的,因为$ \ widetilde {\ Omega} \ big(\ varepsilon ^ { - (d + 2)} \ big)的minimax下限。$
translated by 谷歌翻译
机会约束马尔可夫决策过程使受到大量失败概率的奖励最大化,并且经常被应用于具有潜在危险结果或未知环境的计划。解决方案算法需要强大的启发式算法,或者仅限于数百万个州的相对较小的问题,因为从给定状态获得的最优动作取决于策略其余部分中失败的概率,导致难以解决的耦合问题。在本文中,我们研究了CCMDP的概括,该概括通过功能关系来消除对奖励的概率。我们推导出可以单独应用于策略中的每个状态历史的aconstraint,并保证满足机会约束。该方法在CCMDP中解耦状态,因此可以有效地解决大问题。然后我们引入Vulcan,它使用我们的约束来将蒙特卡罗树搜索应用于CCMDP。 Vulcan可以应用于生成整个状态空间不可行的问题,并且必须以任何时间方式返回策略。我们证明Vulc​​an及其变体比线性编程方法快几十到几百倍,比基于启发式的方法快十倍以上,所有这些都不需要具有几个百分位数的平均次优性的启发式和回归解决方案。最后,我们使用Vulcan在3分钟内解决CCMDP中的机会约束策略,其中$ 10 ^ {13} $状态。
translated by 谷歌翻译
在本文中,我们考虑具有峰值约束的马尔可夫决策过程(MDP)的强化学习,其中代理选择策略以优化目标并同时满足附加约束。 Theagent必须根据观察到的状态,奖励输出和约束输出采取行动,而不需要了解动态,奖励函数和/或约束函数的知识。我们引入了一种游戏理论方法来构建强化学习算法,其中代理最大化一个无约束的目标,该目标依赖于最小化对手的模拟动作,该动作作用于一组有限的动作和约束函数的输出数据(奖励)。我们证明了从maximin Q-learning获得的策略收敛于最优策略。据我们所知,这是第一次学习算法保证收敛到MDP问题的最优固定策略,并且对于折扣和预期平均奖励具有峰值约束。
translated by 谷歌翻译
我们考虑一种新颖的多臂强盗框架,其中通过拉动臂获得的奖励是共同潜在随机变量的函数。由于常见的随机源引起的臂之间的相关性可以用于设计广义上置信度约束(UCB)算法,该算法将某些对象标识为$ non-competitive $,并避免探索它们。结果,我们将一个价值$ K $的匪徒问题减少到$ C + 1 $ -armed问题,其中$ C + 1 $包括最好的手臂和$ C $ $竞争性美元武器。我们的遗憾分析表明,竞争武器需要被拉到$ \ mathcal {O}(\ log T)$次,而那些竞争武器只被拉到$ \ mathcal {O}(1)$ times。结果,在我们的算法达到$ \ mathcal {O}(1)$遗憾的情况下,与多臂强盗算法的典型对数后悔缩放相反。我们还评估了预期遗憾的下界,并证明了我们相关的UCB算法是按顺序优化的。
translated by 谷歌翻译
本文提出了一种多臂强盗(MAB)过程的一般框架,通过对不断变化的时间内武器之间的开关引入一种限制。 Gittins索引过程是针对任何受开关限制的单臂构建的,然后建立相应的Gittinsindex规则的最优性。本文定义的Gittins指数与连续时间,整数时间,半马尔可夫设置以及一般离散时间设置的MAB过程一致,因此新理论将经典模型作为特例包含在内,也适用于其他多种情况。尚未在文献中涉及的内容。虽然Gittins指数政策的最优性证明可以从连续时间的现有MAB过程理论中获益,但引入了新技术,大大简化了证明。
translated by 谷歌翻译
For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s ′ there is a policy which moves from s to s ′ in at most D steps (on average). We present a reinforcement learning algorithm with total regret˜Oregret˜ regret˜O(DS √ AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D. A corresponding lower bound of Ω(√ DSAT) on the total regret of any learning algorithm is given as well. These results are complemented by a sample complexity bound on the number of suboptimal steps taken by our algorithm. This bound can be used to achieve a (gap-dependent) regret bound that is logarithmic in T. Finally, we also consider a setting where the MDP is allowed to change a fixed number of ℓ times. We present a modification of our algorithm that is able to deal with this setting and show a regret bound of˜Oof˜ of˜O(ℓ 1/3 T 2/3 DS √ A).
translated by 谷歌翻译