Abstract
The paper investigates the possibility of applying value function based reinforcement learning (RL) methods in cases when the environment may change over time. First, theorems are presented which show that the optimal value function of a discounted Markov decision process (MDP) Lipschitz continuously depends on the immediate-cost function and the transition-probability function. Dependence on the discount factor is also analyzed and shown to be non-Lipschitz. Afterwards, the concept of (ε,δ)-MDPs is introduced, which is a generalization of MDPs and ε-MDPs. In this model the environment may change over time, more precisely, the transition function and the cost function may vary from time to time, but the changes must be bounded in the limit. Then, learning algorithms in changing environments are analyzed. A general relaxed convergence theorem for stochastic iterative algorithms is presented. We also demonstrate the results through three classical RL methods: asynchronous value iteration, Q-learning and temporal difference learning. Finally, some numerical experiments concerning changing environments are presented.
- D. P. Bertsekas. Dynamic Programming and Optimal Control, volume 2. Athena Scientific, Belmont, Massachusetts, 3rd edition, 2007. Google ScholarDigital Library
- D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996. Google ScholarDigital Library
- B. Cs. Csáji. Adaptive Resource Control: Machine Learning Approaches to Resource Allocation in Uncertain and Changing Environments. PhD thesis, Faculty of Informatics, Eötvös Loránd University, Budapest, 2008.Google Scholar
- B. Cs. Csáji and L. Monostori. Adaptive sampling based large-scale stochastic resource control. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI-06), July 16-20, Boston, Massachusetts, pages 815-820, 2006. Google ScholarDigital Library
- R. Montes de Oca, A. Sakhanenko, and F. Salem. Estimates for perturbations of general discounted Markov control chains. Applied Mathematics, 30:287-304, 2003.Google Scholar
- E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning Research (JMLR), 5:1-25, Dec. 2003. Google ScholarDigital Library
- G. Favero and W. J. Runggaldier. A robustness result for stochastic control. Systems and Control Letters, 46:91-66, 2002.Google ScholarCross Ref
- E. A. Feinberg and A. Shwartz, editors. Handbook of Markov Decision Processes: Methods and Applications. Kluwer Academic Publishers, 2002.Google ScholarCross Ref
- E. Gordienko and F. S. Salem. Estimates of stability of Markov control processes with unbounded cost. Kybernetika, 36:195-210, 2000.Google Scholar
- Zs. Kalmár, Cs. Szepesvári, and A. Lorincz. Module-based reinforcement learning: Experiments with a real robot. Machine Learning, 31:55-85, 1998. Google ScholarDigital Library
- M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning , 49:209-232, 2002. Google ScholarDigital Library
- A. Müller. How does the solution of a Markov decision process depend on the transition probabilities? Technical report, Institute for Economic Theory and Operations Research, University of Karlsruhe, 1996.Google Scholar
- R. Munos and A. W. Moore. Rates of convergence for variable resolution schemes in optimal control. In Proceedings of the 17th International Conference on Machine Learning (ICML), pages 647-654. Morgan Kaufmann, San Francisco, CA, 2000. Google ScholarDigital Library
- M. Pinedo. Scheduling: Theory, Algorithms, and Systems. Prentice-Hall, 2002. Google ScholarDigital Library
- S. Singh and D. Bertsekas. Reinforcement learning for dynamic channel allocation in cellular telephone systems. In Advances in Neural Information Processing Systems, volume 9, pages 974-980. The MIT Press, 1997.Google Scholar
- R. S. Sutton and A. G. Barto. Reinforcement Learning. The MIT Press, 1998. Google ScholarDigital Library
- R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems, 12: 1057-1063, 2000.Google ScholarDigital Library
- Cs. Szepesvári and M. L. Littman. A unified analysis of value-function-based reinforcement learning algorithms. Neural Computation, 11(8):2017-2060, 1999. Google ScholarDigital Library
- I. Szita, B. Takács, and A. Lörincz. ¿-MDPs: Learning in varying environments. Journal of Machine Learning Research (JMLR), 3:145-174, 2002. Google ScholarDigital Library
- B. Van Roy, D. Bertsekas, Y. Lee, and J. Tsitsiklis. A neuro-dynamic programming approach to retailer inventory management. Technical report, Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, MA., 1996.Google Scholar
Index Terms
Value Function Based Reinforcement Learning in Changing Markovian Environments
Recommendations
Inverse Reinforcement Learning in Partially Observable Environments
Inverse reinforcement learning (IRL) is the problem of recovering the underlying reward function from the behavior of an expert. Most of the existing IRL algorithms assume that the environment is modeled as a Markov decision process (MDP), although it ...
Basis function construction for hierarchical reinforcement learning
AAMAS '10: Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: volume 1 - Volume 1Much past work on solving Markov decision processes (MDPs) using reinforcement learning (RL) has relied on combining parameter estimation methods with hand-designed function approximation architectures for representing value functions. Recently, there ...
Reinforcement learning algorithm for non-stationary environments
AbstractReinforcement learning (RL) methods learn optimal decisions in the presence of a stationary environment. However, the stationary assumption on the environment is very restrictive. In many real world problems like traffic signal control, robotic ...
Comments