skip to main content
research-article

Probabilistic Policy Reuse for Safe Reinforcement Learning

Published:15 March 2019Publication History
Skip Abstract Section

Abstract

This work introduces Policy Reuse for Safe Reinforcement Learning, an algorithm that combines Probabilistic Policy Reuse and teacher advice for safe exploration in dangerous and continuous state and action reinforcement learning problems in which the dynamic behavior is reasonably smooth and the space is Euclidean. The algorithm uses a continuously increasing monotonic risk function that allows for the identification of the probability to end up in failure from a given state. Such a risk function is defined in terms of how far such a state is from the state space known by the learning agent. Probabilistic Policy Reuse is used to safely balance the exploitation of actual learned knowledge, the exploration of new actions, and the request of teacher advice in parts of the state space considered dangerous. Specifically, the π-reuse exploration strategy is used. Using experiments in the helicopter hover task and a business management problem, we show that the π-reuse exploration strategy can be used to completely avoid the visit to undesirable situations while maintaining the performance (in terms of the classical long-term accumulated reward) of the final policy achieved.

References

  1. Agnar Aamodt and Enric Plaza. 1994. Case-based reasoning; foundational issues, methodological variations, and system approaches. AI Communications 7, 1 (1994), 39--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Brenna D. Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. 2009. A survey of robot learning from demonstration. Robotics and Autonomous Systems 57, 5 (May 2009), 469--483. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Fernando Borrajo, Yolanda Bueno, Isidro de Pablo, Begoña Santos, Fernando Fernández, Javier García, et al. 2010. SIMBA: A simulator for business education and research. Decision Support Systems 48, 3 (2010), 498--509. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Yin Cheng and Weidong Zhang. 2018. Concise deep reinforcement learning obstacle avoidance for underactuated unmanned marine vessels. Neurocomputing 272 (2018), 63--73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Marc Peter Deisenroth, Gerhard Neumann, and Jan Peters. 2013. A survey on policy search for robotics. Foundations and Trends in Robotics 2, 1-2 (2013), 1--142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Fernando Fernández and Manuela Veloso. 2006. Probabilistic policy reuse in a reinforcement learning agent. In Proceedings of the 5th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS’06). ACM, New York, NY, 720--727. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Fernando Fernández and Manuela M. Veloso. 2013. Learning domain structure through probabilistic policy reuse in reinforcement learning. Progress in Artificial Intelligence 2, 1 (2013), 13--27.Google ScholarGoogle ScholarCross RefCross Ref
  8. Javier García, Fernando Borrajo, and Fernando Fernández. 2012. Reinforcement learning for decision-making in a business simulator. International Journal of Information Technology and Decision Making 11, 05 (2012), 935--960.Google ScholarGoogle ScholarCross RefCross Ref
  9. Javier García and Fernando Fernández. 2012. Safe exploration of the state and action spaces in reinforcement learning. Journal of Artificial Intelligence Research 45 (2012), 515--564. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Peter Geibel and Fritz Wysotzki. 2005. Risk-sensitive reinforcement learning applied to control under constraints. Journal of Artificial Intelligence 24 (2005), 81--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Todd Hester and Peter Stone. 2013. TEXPLORE: Real-time sample-efficient reinforcement learning for robots. Machine Learning 90, 3 (March 2013), 385--429. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jonathan Ho and Stefano Ermon. 2016. Generative adversarial imitation learning. In Proceedings of Advances in Neural Information Processing Systems 29. 4565--4573. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Bing-Qiang Huang, Guang-Yi Cao, and Min Guo. 2005. Reinforcement learning neural network to the problem of autonomous mobile robot obstacle avoidance. In Proceedings of 2005 International Conference on Machine Learning and Cybernetics, Vol. 1. IEEE, Los Alamitos, CA, 85--89.Google ScholarGoogle Scholar
  14. Leslie Pack Kaelbling, Michael L. Littman, and Andrew P. Moore. 1996. Reinforcement learning: A survey. Journal of Artificial Intelligence Research 4 (1996), 237--285. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Jens Kober and Jan Peters. 2011. Policy search for motor primitives in robotics. Machine Learning 84, 1-2 (2011), 171--203. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Rogier Koppejan and Shimon Whiteson. 2011. Neuroevolutionary reinforcement learning for generalized control of simulated helicopters. Evolutionary Intelligence 4, 4 (Dec. 2011), 219--241.Google ScholarGoogle ScholarCross RefCross Ref
  17. Manon Legrand. 2017. Deep Reinforcement Learning for Autonomous Vehicle Control Among Human Drivers. Ph.D. Dissertation. Universite Libre de Bruxelles.Google ScholarGoogle Scholar
  18. Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. 2016. End-to-end training of deep visuomotor policies. Journal Machine Learning Research 17, 1 (Jan. 2016), 1334--1373. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. A. Martin H and J. de Lope. 2009. Ex<a>: An effective algorithm for continuous actions Reinforcement Learning problems. In IProceedings of the 35th Annual Conference of IEEE Industrial Electronics (IECON’09). IEEE, Los Alamitos, CA, 2063--2068.Google ScholarGoogle Scholar
  20. José Antonio Martín H. and Javier de Lope. 2009. Learning autonomous helicopter flight with evolutionary reinforcement learning. In Computer Aided Systems Theory—EUROCAST 2009, R. Moreno-Díaz, F. Pichler, and A. Quesada-Arencibia (Eds.). Springer, Berlin, Germany, 75--82.Google ScholarGoogle Scholar
  21. Risto Miikkulainen. 2017. Evolution of neural networks. In Proceedings of the Genetic and Evolutionary Computation Conference Companion. ACM, New York, NY, 450--470. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jean-Baptiste Mouret and Konstantinos I. Chatzilygeroudis. 2017. 20 years of reality gap: A few thoughts about simulators in evolutionary robotics. In Companion Material Proceedings of the Genetic and Evolutionary Computation Conference. 1121--1124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Andrew Ng, Jin Kim, Michael Jordan, and Shankar Sastry. 2003. Autonomous helicopter flight via reinforcement learning. In Proceedings of the 16th International Conference on Neural Information Processing Systems (NIPS’03). 799--806. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Jan Peters and Stefan Schaal. 2008. Reinforcement learning of motor skills with policy gradients. Neural Networks 21, 4 (May 2008), 682--697. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Stephane Ross, Geoffrey Gordon, and Drew Bagnell. 2011. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. 627--635.Google ScholarGoogle Scholar
  26. Juan C. Santamaría, Richard S. Sutton, and Ashwin Ram. 1998. Experiments with reinforcement learning in problems with continuous state and action spaces. Adaptive Behavior 6 (1998), 163--218. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.8280. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Bradly C. Stadie, Sergey Levine, and Pieter Abbeel. 2015. Incentivizing exploration in reinforcement learning with deep predictive models. In Advances in Neural Information Processing Systems. 2750--2759.Google ScholarGoogle Scholar
  28. Richard S. Sutton and Andrew G. Barto. 1998. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Lei Tai, Giuseppe Paolo, and Ming Liu. 2017. Virtual-to-real deep reinforcement learning: Continuous control of mobile robots for mapless navigation. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’17). IEEE, Los Alamitos, CA, 31--36.Google ScholarGoogle ScholarCross RefCross Ref
  30. Matthew E. Taylor, Brian Kulis, and Fei Sha. 2011. Metric learning for reinforcement learning agents. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems (AAMAS’11). 777--784. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Matthew E. Taylor, Halit Bener Suay, and Sonia Chernova. 2011. Integrating reinforcement learning with human demonstrations of varying ability. In Proceedings of the 10th International Conference on Autonomous Agents and Multiagent Systems (AAMAS’11), Vol. 2. 617--624. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Lisa Torrey and Matthew E. Taylor. 2012. Help an agent out: Student/teacher learning in sequential decision tasks. In Proceedings of the Adaptive and Learning Agents Workshop (AAMAS’12). 41--48.Google ScholarGoogle Scholar
  33. Haddo van Hasselt and Marco Wiering. 2007. Reinforcement learning in continuous action spaces. In Proceedings of the IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning. IEEE Los Alamitos, CA, 272--279.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Probabilistic Policy Reuse for Safe Reinforcement Learning

        Recommendations

        Reviews

        Giuseppina Carla Gini

        Human and robot planners seek safe and optimal action plans. Learning to adapt good examples, such as the ones provided by a teacher, is an effective way to speed up action planning and can be used to start reinforcement learning. When learning is continuously performed during real actions in a stochastic domain, the same action can lead the agent to different situations; so, repeating a plan does not work and exploration is necessary, with the connected risk of taking the agent to unseen and dangerous situations. Learning during execution and optimizing the plan are indeed conflicting tasks. This paper proposes an algorithm to control how to interplay the two processes. A continuously increasing monotonic risk function, integrated into the policy reuse strategy, is parameterized to guarantee safety, avoiding unknown situations, or optimization, allowing more exploration. The results, discussed in a helicopter problem and in a competition among agents, show how parameter tuning can effectively modify the learning. The paper is well written and easy to follow, the motivations and the resulting algorithm are analyzed, and the proposed solution is appealing. I am looking for more examples, especially in real life, where sensor readings add uncertainties and perhaps more tuning is necessary.

        Access critical reviews of Computing literature here

        Become a reviewer for Computing Reviews.

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Autonomous and Adaptive Systems
          ACM Transactions on Autonomous and Adaptive Systems  Volume 13, Issue 3
          September 2018
          98 pages
          ISSN:1556-4665
          EISSN:1556-4703
          DOI:10.1145/3320018
          Issue’s Table of Contents

          Copyright © 2019 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 15 March 2019
          • Revised: 1 January 2019
          • Accepted: 1 January 2019
          • Received: 1 July 2017
          Published in taas Volume 13, Issue 3

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format