skip to main content
10.1145/2600057.2602897acmconferencesArticle/Chapter ViewAbstractPublication PagesecConference Proceedingsconference-collections
research-article

Incentivizing exploration

Published: 01 June 2014 Publication History

Abstract

We study a Bayesian multi-armed bandit (MAB) setting in which a principal seeks to maximize the sum of expected time-discounted rewards obtained by pulling arms, when the arms are actually pulled by selfish and myopic individuals. Since such individuals pull the arm with highest expected posterior reward (i.e., they always exploit and never explore), the principal must incentivize them to explore by offering suitable payments. Among others, this setting models crowdsourced information discovery and funding agencies incentivizing scientists to perform high-risk, high-reward research.
We explore the tradeoff between the principal's total expected time-discounted incentive payments, and the total time-discounted rewards realized. Specifically, with a time-discount factor γ ∈ (0,1), let OPT denote the total expected time-discounted reward achievable by a principal who pulls arms directly in a MAB problem, without having to incentivize selfish agents. We call a pair (ρ,b) ∈ [0,1]2 consisting of a reward ρ and payment b achievable if for every MAB instance, using expected time-discounted payments of at most b•OPT, the principal can guarantee an expected time-discounted reward of at least ρ•OPT. Our main result is an essentially complete characterization of achievable (payment, reward) pairs: if √b+√1-ρ>√γ, then (ρ,b) is achievable, and if √b+√1-ρ<√γ, then (ρ,b) is not achievable.
In proving this characterization, we analyze so-called time-expanded policies, which in each step let the agents choose myopically with some probability p, and incentivize them to choose "optimally" with probability 1-p. The analysis of time-expanded policies leads to a question that may be of independent interest: If the same MAB instance (without selfish agents) is considered under two different time-discount rates γ > η, how small can the ratio of OPTη to OPTγ be? We give a complete answer to this question, showing that OPTη ≥ (1-γ)2/(1-η)2 • OPTγ, and that this bound is tight.

References

[1]
Ittai Abraham, Omar Alonso, Vasilis Kandylas, and Aleksandrs Slivkins. 2013. Adaptive Crowdsourcing Algorithms for the Bandit Survey Problem. In Proc. 26th Conference on Learning Theory. 882--910.
[2]
Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. 1995. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Foundations of Computer Science, 1995. Proceedings., 36th Annual Symposium on. 322--331.
[3]
Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. 2003. The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32, 1 (2003), 48--77.
[4]
Ashwinkumar Badanidiyuru, Robert Kleinberg, and Yaron Singer. 2012. Learning on a budget: Posted price mechanisms for online procurement. In Proc. 14th ACM Conference on Electronic Commerce. 128--145.
[5]
Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. 2013. Bandits with Knapsacks. In Proc. 54th IEEE Symposium on Foundations of Computer Science (FOCS). 207--216. arxiv.org/abs/1305.2545.
[6]
Dirk Bergemann and Juuso Välimäki. 1996. Learning and Strategic Pricing. Econometrica 64, 5 (1996), 1124--1149.
[7]
Patrick Bolton and Christopher Harris. 1999. Strategic Experimentation. Econometrica 67, 2 (1999), 349--374.
[8]
Eugene B. Dynkin and Alexander A. Yushkevich. 1979. Controlled Markov Processes. Springer, New York.
[9]
John C. Gittins. 1989. Multi-Armed Bandit Allocation Indices. John Wiley and Sons, New York.
[10]
John C. Gittins, Kevin D. Glazebrook, and Richard Weber. 2011. Multi-Armed Bandit Allocation Indices (2nd ed.). John Wiley and Sons, New York.
[11]
John C. Gittins and David M. Jones. 1974. A dynamic allocation index for the sequential design of experiments. In Progress in Statistics, J. Gani (Ed.). North-Holland, Amsterdam, 241--266.
[12]
Ashish Goel, Sanjeev Khanna, and Brad Null. 2009. The ratio index for budgeted learning, with applications. In Proc. 20th ACM-SIAM Symp. on Discrete Algorithms. 18--27.
[13]
Sudipto Guha and Kamesh Munagala. 2007. Approximation algorithms for budgeted learning problems. In Proc. 38th ACM Symp. on Theory of Computing. 104--113.
[14]
Sudipto Guha and Kamesh Munagala. 2009. Multi-armed bandits with metric switching costs. In Proc. 36th International Colloquium on Automata, Languages and Programming (ICALP). Springer, 496--507.
[15]
Sudipto Guha and Kamesh Munagala. 2013. Approximation Algorithms for Bayesian Multi-Armed Bandit Problems. (2013). arxiv.org/abs/1306.3525.
[16]
Chien-Ju Ho, Aleksandrs Slivkins, and Jennifer Wortman Vaughan. 2013. Adaptive Contract Design for Crowdsourcing. (2013). Working Paper.
[17]
Michael N. Katehakis and Arthur F. Veinott, Jr. 1987. The multi-armed bandit problem: decomposition and computation. Math. Oper. Res. 12, 2 (1987), 262--268.
[18]
Godfrey Keller, Sven Rady, and Martin Cripps. 2005. Strategic experimentation with exponential bandits. Econometrica 73, 1 (2005), 39--68.
[19]
Robert Kleinberg, Alexandrs Slivkins, and Eli Upfal. 2008. Multi-Armed Bandits in Metric Spaces. In Proc. 39th ACM Symp. on Theory of Computing. 681--690.
[20]
Ilan Kremer, Yishay Mansour, and Motty Perry. 2013. Implementing the "Wisdom of the Crowd". In Proc. 15th ACM Conf. on Electronic Commerce. 605--606.
[21]
Tze Leung Lai and Herbert E. Robbins. 1985. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6, 1 (1985), 4--22.
[22]
Chris J. Lintott, Kevin Schawinski, Ane Slosar, Kate Land, Steven Bamford, Daniel Thomas, M. Jordan Raddick, Robert C. Nichol, Alex Szalay, Dan Andreescu, Phil Murray, and Jan Vandenberg. 2008. Galaxy Zoo: morphologies derived from visual inspection of ga laxies from the Sloan Digital Sky Survey. Monthly Notices of the Royal Astronomical Society 389, 3 (Sept. 2008), 1179--1189.
[23]
Aditya Mahajan and Demosthenos Teneketzis. 2007. Multi-armed Bandit Problems. In Foundations and Applications of Sensor Management, D. Cochran A. O. Hero III, D. A. Castanon and K. Kastella (Eds.). Springer-Verlag.
[24]
Hebert E. Robbins. 1952. Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. 58 (1952), 527--535.
[25]
Daniel Sheldon, M. A. Saleh Elmohamed, and Dexter Kozen. 2007. Collective Inference on Markov Models for Modeling Bird Migration. In NIPS.
[26]
Adish Singla and Andreas Krause. 2013. Truthful incentives in crowdsourcing tasks using regret minimization mechanisms. In 22nd Intl. World Wide Web Conference. 1167--1178.
[27]
Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias W. Seeger. 2012. Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting. IEEE Transactions on Information Theory 58, 5 (2012), 3250--3265.
[28]
Pravin P. Varaiya, Jean C. Walrand, and Cagatay Buyukkoc. 1985. Extensions of the multiarmed bandit problem: The discounted case. IEEE Trans. Automat. Control 30, 5 (1985), 426--439.
[29]
Peter Whittle. 1980. Multi-armed bandits and the Gittins index. Journal of the Royal Statistical Society. Series B (Methodological) 42, 2 (1980), 143-- 149.
[30]
Yexiang Xue, Bistra N. Dilkina, Theodoros Damoulas, Daniel Fink, Carla P. Gomes, and Steve Kelling. 2013. Improving Your Chances: Boosting Citizen Science Discovery. In HCOMP.

Cited By

View all
  • (2024)Incentivized Exploration via Filtered Posterior SamplingSSRN Electronic Journal10.2139/ssrn.4733191Online publication date: 2024
  • (2024)Clickbait vs. Quality: How Engagement-Based Optimization Shapes the Content Landscape in Online PlatformsProceedings of the ACM Web Conference 202410.1145/3589334.3645353(36-45)Online publication date: 13-May-2024
  • (2023)Bandit social learning under myopic behaviorProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666578(10385-10411)Online publication date: 10-Dec-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
EC '14: Proceedings of the fifteenth ACM conference on Economics and computation
June 2014
1028 pages
ISBN:9781450325653
DOI:10.1145/2600057
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. crowdsourcing
  2. exploration
  3. incentives
  4. multi-armed bandit problems

Qualifiers

  • Research-article

Funding Sources

Conference

EC '14
Sponsor:
EC '14: ACM Conference on Economics and Computation
June 8 - 12, 2014
California, Palo Alto, USA

Acceptance Rates

EC '14 Paper Acceptance Rate 80 of 290 submissions, 28%;
Overall Acceptance Rate 664 of 2,389 submissions, 28%

Upcoming Conference

EC '25
The 25th ACM Conference on Economics and Computation
July 7 - 11, 2025
Stanford , CA , USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)48
  • Downloads (Last 6 weeks)5
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Incentivized Exploration via Filtered Posterior SamplingSSRN Electronic Journal10.2139/ssrn.4733191Online publication date: 2024
  • (2024)Clickbait vs. Quality: How Engagement-Based Optimization Shapes the Content Landscape in Online PlatformsProceedings of the ACM Web Conference 202410.1145/3589334.3645353(36-45)Online publication date: 13-May-2024
  • (2023)Bandit social learning under myopic behaviorProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666578(10385-10411)Online publication date: 10-Dec-2023
  • (2023)Incentivizing exploration with linear contexts and combinatorial actionsProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619675(30570-30583)Online publication date: 23-Jul-2023
  • (2023)Sample Complexity of Decentralized Tabular Q-Learning for Stochastic Games2023 American Control Conference (ACC)10.23919/ACC55779.2023.10155822(1098-1103)Online publication date: 31-May-2023
  • (2023)Incentivizing Exploration in Linear Contextual Bandits under Information GapProceedings of the 17th ACM Conference on Recommender Systems10.1145/3604915.3608794(415-425)Online publication date: 14-Sep-2023
  • (2023)Learning Equilibria in Matching Markets with Bandit FeedbackJournal of the ACM10.1145/358368170:3(1-46)Online publication date: 24-May-2023
  • (2023)Reward Teaching for Federated Multiarmed BanditsIEEE Transactions on Signal Processing10.1109/TSP.2023.333365871(4407-4422)Online publication date: 2023
  • (2023)Budget-Constrained Cost-Covering Job Assignment for a Total Contribution-Maximizing PlatformCombinatorial Algorithms10.1007/978-3-031-34347-6_33(392-403)Online publication date: 3-Jun-2023
  • (2023)The Societal Impacts of Algorithmic Decision-MakingundefinedOnline publication date: 7-Sep-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media