ACM Home Page
Please provide us with feedback. Feedback
Relating reinforcement learning performance to classification performance
Full text PdfPdf (887 KB)
Source ACM International Conference Proceeding Series; Vol. 119 archive
Proceedings of the 22nd international conference on Machine learning table of contents
Bonn, Germany
Pages: 473 - 480  
Year of Publication: 2005
ISBN:1-59593-180-5
Authors
John Langford  TTI-Chicago, Chicago, IL
Bianca Zadrozny  IBM T.J. Watson, Yorktown Heights, NY
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 5,   Downloads (12 Months): 27,   Citation Count: 0
Additional Information:

abstract   references   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1102351.1102411
What is a DOI?

ABSTRACT

We prove a quantitative connection between the expected sum of rewards of a policy and binary classification performance on created subproblems. This connection holds without any unobservable assumptions (no assumption of independence, small mixing time, fully observable states, or even hidden states) and the resulting statement is independent of the number of states or actions. The statement is critically dependent on the size of the rewards and prediction performance of the created classifiers.We also provide some general guidelines for obtaining good classification performance on the created subproblems. In particular, we discuss possible methods for generating training examples for a classifier learning algorithm.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Bagnell, D. (2004). Personal communication.
 
2
Bagnell, D., Kakade, S., Ng, A., & Schneider, J. (2004). Policy search by dynamic programming. Advances in Neural Information Processing Systems 16 (NIPS*2003).
 
3
Baird, L. C. (1993). Advantage updating (Technical Report). Wright Laboratory.
 
4
 
5
Beygelzimer, A., Dani, V., Hayes, T., Langford, J., & Zadrozny, B. (2004). Reductions between classification tasks. Preprint at http://hunch.net/~jl/projects/reductions/mc_to_c/mc_submission.ps.
 
6
Evan-Dar, E., Kakade, S., & Mansour, Y. (2005). Reinforcement learning in POMDPs without resets. Preprint at http://www.cis.upenn.edu/~skakade/papers/rl/learn_pomdp.ps.
 
7
Fern, A., Yoon, S., & Givan, R. (2004). Approximate policy iteration with a policy language bias. Advances in Neural Information Processing Systems 16 (NIPS*2003).
 
8
 
9
 
10
Kearns, M., Ng, A., & Mansour, Y. (2000). Approximate planning in large pomdps via reusable trajectories. Advances in Neural Information Processing Systems 12 (NIPS*1999).
 
11
 
12
 
13
Lagoudakis, M., & Parr, R. (2003). Reinforcement learning as classification: Leveraging modern classifiers. Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003) (pp. 424--431).
 
14
Langford, J., & Beygelzimer, A. (2005). Sensitive error correcting output codes. Proceedings of the Eighteenth Annual Conference on Learning Theory (COLT-2005). To appear.
 
15
Langford, J., & Zadrozny, B. (2003). Reducing T-step reinforcement learning to classification. Preprint at http://hunch.net/~jl/projects/reductions/RL_to_class/colt_submission.ps%.
 
16
Langford, J., & Zadrozny, B. (2005). Estimating class membership probabilities using classifier learners. Proceedings of the Tenth International Workshop on AI and Statistics (AISTATS-2005) (pp. 198--205).
 
17
 
18
 
19
Watkins, C. (1989). Learning from delayed rewards. Doctoral dissertation, Cambridge University.
 
20
Collaborative Colleagues:
John Langford: colleagues
Bianca Zadrozny: colleagues