research-article

Incentivizing exploration

Authors:

Peter Frazier,

David Kempe,

Jon Kleinberg,

Robert KleinbergAuthors Info & Claims

EC '14: Proceedings of the fifteenth ACM conference on Economics and computation

Pages 5 - 22

https://doi.org/10.1145/2600057.2602897

Published: 01 June 2014 Publication History

Get Access

Abstract

We study a Bayesian multi-armed bandit (MAB) setting in which a principal seeks to maximize the sum of expected time-discounted rewards obtained by pulling arms, when the arms are actually pulled by selfish and myopic individuals. Since such individuals pull the arm with highest expected posterior reward (i.e., they always exploit and never explore), the principal must incentivize them to explore by offering suitable payments. Among others, this setting models crowdsourced information discovery and funding agencies incentivizing scientists to perform high-risk, high-reward research.

We explore the tradeoff between the principal's total expected time-discounted incentive payments, and the total time-discounted rewards realized. Specifically, with a time-discount factor γ ∈ (0,1), let OPT denote the total expected time-discounted reward achievable by a principal who pulls arms directly in a MAB problem, without having to incentivize selfish agents. We call a pair (ρ,b) ∈ [0,1]² consisting of a reward ρ and payment b achievable if for every MAB instance, using expected time-discounted payments of at most b•OPT, the principal can guarantee an expected time-discounted reward of at least ρ•OPT. Our main result is an essentially complete characterization of achievable (payment, reward) pairs: if √b+√1-ρ>√γ, then (ρ,b) is achievable, and if √b+√1-ρ<√γ, then (ρ,b) is not achievable.

In proving this characterization, we analyze so-called time-expanded policies, which in each step let the agents choose myopically with some probability p, and incentivize them to choose "optimally" with probability 1-p. The analysis of time-expanded policies leads to a question that may be of independent interest: If the same MAB instance (without selfish agents) is considered under two different time-discount rates γ > η, how small can the ratio of OPT_η to OPT_γ be? We give a complete answer to this question, showing that OPT_η ≥ (1-γ)²/(1-η)² • OPT_γ, and that this bound is tight.

References

[1]

Ittai Abraham, Omar Alonso, Vasilis Kandylas, and Aleksandrs Slivkins. 2013. Adaptive Crowdsourcing Algorithms for the Bandit Survey Problem. In Proc. 26th Conference on Learning Theory. 882--910.

Abstract

References

Cited By

Index Terms

Recommendations

The Price of Incentivizing Exploration: A Characterization via Thompson Sampling and Sample Complexity

Incentivizing Exploration with Heterogeneous Value of Money

Nonlinear Pricing of Information Goods

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations