|
ABSTRACT
We prove a quantitative connection between the expected sum of rewards of a policy and binary classification performance on created subproblems. This connection holds without any unobservable assumptions (no assumption of independence, small mixing time, fully observable states, or even hidden states) and the resulting statement is independent of the number of states or actions. The statement is critically dependent on the size of the rewards and prediction performance of the created classifiers.We also provide some general guidelines for obtaining good classification performance on the created subproblems. In particular, we discuss possible methods for generating training examples for a classifier learning algorithm.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Bagnell, D. (2004). Personal communication.
|
| |
2
|
Bagnell, D., Kakade, S., Ng, A., & Schneider, J. (2004). Policy search by dynamic programming. Advances in Neural Information Processing Systems 16 (NIPS*2003).
|
| |
3
|
Baird, L. C. (1993). Advantage updating (Technical Report). Wright Laboratory.
|
| |
4
|
|
| |
5
|
Beygelzimer, A., Dani, V., Hayes, T., Langford, J., & Zadrozny, B. (2004). Reductions between classification tasks. Preprint at http://hunch.net/~jl/projects/reductions/mc_to_c/mc_submission.ps.
|
| |
6
|
Evan-Dar, E., Kakade, S., & Mansour, Y. (2005). Reinforcement learning in POMDPs without resets. Preprint at http://www.cis.upenn.edu/~skakade/papers/rl/learn_pomdp.ps.
|
| |
7
|
Fern, A., Yoon, S., & Givan, R. (2004). Approximate policy iteration with a policy language bias. Advances in Neural Information Processing Systems 16 (NIPS*2003).
|
| |
8
|
|
| |
9
|
|
| |
10
|
Kearns, M., Ng, A., & Mansour, Y. (2000). Approximate planning in large pomdps via reusable trajectories. Advances in Neural Information Processing Systems 12 (NIPS*1999).
|
| |
11
|
|
| |
12
|
|
| |
13
|
Lagoudakis, M., & Parr, R. (2003). Reinforcement learning as classification: Leveraging modern classifiers. Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003) (pp. 424--431).
|
| |
14
|
Langford, J., & Beygelzimer, A. (2005). Sensitive error correcting output codes. Proceedings of the Eighteenth Annual Conference on Learning Theory (COLT-2005). To appear.
|
| |
15
|
Langford, J., & Zadrozny, B. (2003). Reducing T-step reinforcement learning to classification. Preprint at http://hunch.net/~jl/projects/reductions/RL_to_class/colt_submission.ps%.
|
| |
16
|
Langford, J., & Zadrozny, B. (2005). Estimating class membership probabilities using classifier learners. Proceedings of the Tenth International Workshop on AI and Statistics (AISTATS-2005) (pp. 198--205).
|
| |
17
|
|
| |
18
|
|
| |
19
|
Watkins, C. (1989). Learning from delayed rewards. Doctoral dissertation, Cambridge University.
|
| |
20
|
|
|