We study a Bayesian multi-armed bandit (MAB) setting in which a principal seeks to maximize the sum of expected time-discounted rewards obtained by pulling arms, when the arms are actually pulled by selfish and myopic individuals. Since such individuals pull the arm with highest expected posterior reward (i.e., they always exploit and never explore), the principal must incentivize them to explore by offering suitable payments. Among others, this setting models crowdsourced information discovery and funding agencies incentivizing scientists to perform high-risk, high-reward research. We explore the tradeoff between the principal's total expected time-discounted incentive payments, and the total time-discounted rewards realized. Specifically, with a time-discount factor γ ∈ (0, 1), let OPT denote the total expected time-discounted reward achievable by a principal who pulls arms directly in a MAB problem, without having to incentivize selfish agents. We call a pair (ρ, b) ∈ [0, 1] 2 consisting of a reward ρ and payment b achievable if for every MAB instance, using expected time-discounted payments of at most b·OPT, the principal can guarantee an expected time-discounted reward of at least ρ·OPT. Our main result is an essentially complete characterization of achievable (payment, reward) pairs: if √ b+ √ 1 − ρ > √ γ, then (ρ, b) is achievable, and if √ b + √ 1 − ρ < √ γ, then (ρ, b) is not achievable. In proving this characterization, we analyze so-called time-expanded policies, which in each step let the agents choose myopically with some probability p, and incentivize them to choose "optimally" with probability 1 − p. The analysis of time-expanded policies leads to a question that may be of independent interest: If the same MAB instance (without selfish agents) is considered under two different time-discount rates γ > η, how small can the ratio of OPTη to OPTγ be? We give a complete answer to this question, showing that OPTη ≥ (1−γ) 2 (1−η) 2 · OPTγ , and that this bound is tight.
translated by 谷歌翻译