research-article

Probabilistic Policy Reuse for Safe Reinforcement Learning

Authors:
Javier García

Universidad Carlos III de Madrid, Spain

Universidad Carlos III de Madrid, Spain
View Profile

,
Fernando Fernández

Universidad Carlos III de Madrid, Spain

Universidad Carlos III de Madrid, Spain
View Profile

ACM Transactions on Autonomous and Adaptive Systems Volume 13 Issue 3Article No.: 14pp 1–24https://doi.org/10.1145/3310090

Published:15 March 2019Publication History

ACM Transactions on Autonomous and Adaptive Systems

Abstract

This work introduces Policy Reuse for Safe Reinforcement Learning, an algorithm that combines Probabilistic Policy Reuse and teacher advice for safe exploration in dangerous and continuous state and action reinforcement learning problems in which the dynamic behavior is reasonably smooth and the space is Euclidean. The algorithm uses a continuously increasing monotonic risk function that allows for the identification of the probability to end up in failure from a given state. Such a risk function is defined in terms of how far such a state is from the state space known by the learning agent. Probabilistic Policy Reuse is used to safely balance the exploitation of actual learned knowledge, the exploration of new actions, and the request of teacher advice in parts of the state space considered dangerous. Specifically, the π-reuse exploration strategy is used. Using experiments in the helicopter hover task and a business management problem, we show that the π-reuse exploration strategy can be used to completely avoid the visit to undesirable situations while maintaining the performance (in terms of the classical long-term accumulated reward) of the final policy achieved.

References

Agnar Aamodt and Enric Plaza. 1994. Case-based reasoning; foundational issues, methodological variations, and system approaches. AI Communications 7, 1 (1994), 39--59. Google ScholarDigital Library
Brenna D. Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. 2009. A survey of robot learning from demonstration. Robotics and Autonomous Systems 57, 5 (May 2009), 469--483. Google ScholarDigital Library
Fernando Borrajo, Yolanda Bueno, Isidro de Pablo, Begoña Santos, Fernando Fernández, Javier García, et al. 2010. SIMBA: A simulator for business education and research. Decision Support Systems 48, 3 (2010), 498--509. Google ScholarDigital Library
Yin Cheng and Weidong Zhang. 2018. Concise deep reinforcement learning obstacle avoidance for underactuated unmanned marine vessels. Neurocomputing 272 (2018), 63--73. Google ScholarDigital Library
Marc Peter Deisenroth, Gerhard Neumann, and Jan Peters. 2013. A survey on policy search for robotics. Foundations and Trends in Robotics 2, 1-2 (2013), 1--142. Google ScholarDigital Library
Fernando Fernández and Manuela Veloso. 2006. Probabilistic policy reuse in a reinforcement learning agent. In Proceedings of the 5th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS’06). ACM, New York, NY, 720--727. Google ScholarDigital Library
Fernando Fernández and Manuela M. Veloso. 2013. Learning domain structure through probabilistic policy reuse in reinforcement learning. Progress in Artificial Intelligence 2, 1 (2013), 13--27.Google ScholarCross Ref
Javier García, Fernando Borrajo, and Fernando Fernández. 2012. Reinforcement learning for decision-making in a business simulator. International Journal of Information Technology and Decision Making 11, 05 (2012), 935--960.Google ScholarCross Ref
Javier García and Fernando Fernández. 2012. Safe exploration of the state and action spaces in reinforcement learning. Journal of Artificial Intelligence Research 45 (2012), 515--564. Google ScholarDigital Library
Peter Geibel and Fritz Wysotzki. 2005. Risk-sensitive reinforcement learning applied to control under constraints. Journal of Artificial Intelligence 24 (2005), 81--108. Google ScholarDigital Library
Todd Hester and Peter Stone. 2013. TEXPLORE: Real-time sample-efficient reinforcement learning for robots. Machine Learning 90, 3 (March 2013), 385--429. Google ScholarDigital Library
Jonathan Ho and Stefano Ermon. 2016. Generative adversarial imitation learning. In Proceedings of Advances in Neural Information Processing Systems 29. 4565--4573. Google ScholarDigital Library
Bing-Qiang Huang, Guang-Yi Cao, and Min Guo. 2005. Reinforcement learning neural network to the problem of autonomous mobile robot obstacle avoidance. In Proceedings of 2005 International Conference on Machine Learning and Cybernetics, Vol. 1. IEEE, Los Alamitos, CA, 85--89.Google Scholar
Leslie Pack Kaelbling, Michael L. Littman, and Andrew P. Moore. 1996. Reinforcement learning: A survey. Journal of Artificial Intelligence Research 4 (1996), 237--285. Google ScholarDigital Library
Jens Kober and Jan Peters. 2011. Policy search for motor primitives in robotics. Machine Learning 84, 1-2 (2011), 171--203. Google ScholarDigital Library
Rogier Koppejan and Shimon Whiteson. 2011. Neuroevolutionary reinforcement learning for generalized control of simulated helicopters. Evolutionary Intelligence 4, 4 (Dec. 2011), 219--241.Google ScholarCross Ref
Manon Legrand. 2017. Deep Reinforcement Learning for Autonomous Vehicle Control Among Human Drivers. Ph.D. Dissertation. Universite Libre de Bruxelles.Google Scholar
Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. 2016. End-to-end training of deep visuomotor policies. Journal Machine Learning Research 17, 1 (Jan. 2016), 1334--1373. Google ScholarDigital Library
J. A. Martin H and J. de Lope. 2009. Ex<a>: An effective algorithm for continuous actions Reinforcement Learning problems. In IProceedings of the 35th Annual Conference of IEEE Industrial Electronics (IECON’09). IEEE, Los Alamitos, CA, 2063--2068.Google Scholar
José Antonio Martín H. and Javier de Lope. 2009. Learning autonomous helicopter flight with evolutionary reinforcement learning. In Computer Aided Systems Theory—EUROCAST 2009, R. Moreno-Díaz, F. Pichler, and A. Quesada-Arencibia (Eds.). Springer, Berlin, Germany, 75--82.Google Scholar
Risto Miikkulainen. 2017. Evolution of neural networks. In Proceedings of the Genetic and Evolutionary Computation Conference Companion. ACM, New York, NY, 450--470. Google ScholarDigital Library
Jean-Baptiste Mouret and Konstantinos I. Chatzilygeroudis. 2017. 20 years of reality gap: A few thoughts about simulators in evolutionary robotics. In Companion Material Proceedings of the Genetic and Evolutionary Computation Conference. 1121--1124. Google ScholarDigital Library
Andrew Ng, Jin Kim, Michael Jordan, and Shankar Sastry. 2003. Autonomous helicopter flight via reinforcement learning. In Proceedings of the 16th International Conference on Neural Information Processing Systems (NIPS’03). 799--806. Google ScholarDigital Library
Jan Peters and Stefan Schaal. 2008. Reinforcement learning of motor skills with policy gradients. Neural Networks 21, 4 (May 2008), 682--697. Google ScholarDigital Library
Stephane Ross, Geoffrey Gordon, and Drew Bagnell. 2011. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. 627--635.Google Scholar
Juan C. Santamaría, Richard S. Sutton, and Ashwin Ram. 1998. Experiments with reinforcement learning in problems with continuous state and action spaces. Adaptive Behavior 6 (1998), 163--218. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.8280. Google ScholarDigital Library
Bradly C. Stadie, Sergey Levine, and Pieter Abbeel. 2015. Incentivizing exploration in reinforcement learning with deep predictive models. In Advances in Neural Information Processing Systems. 2750--2759.Google Scholar
Richard S. Sutton and Andrew G. Barto. 1998. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA. Google ScholarDigital Library
Lei Tai, Giuseppe Paolo, and Ming Liu. 2017. Virtual-to-real deep reinforcement learning: Continuous control of mobile robots for mapless navigation. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’17). IEEE, Los Alamitos, CA, 31--36.Google ScholarCross Ref
Matthew E. Taylor, Brian Kulis, and Fei Sha. 2011. Metric learning for reinforcement learning agents. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems (AAMAS’11). 777--784. Google ScholarDigital Library
Matthew E. Taylor, Halit Bener Suay, and Sonia Chernova. 2011. Integrating reinforcement learning with human demonstrations of varying ability. In Proceedings of the 10th International Conference on Autonomous Agents and Multiagent Systems (AAMAS’11), Vol. 2. 617--624. Google ScholarDigital Library
Lisa Torrey and Matthew E. Taylor. 2012. Help an agent out: Student/teacher learning in sequential decision tasks. In Proceedings of the Adaptive and Learning Agents Workshop (AAMAS’12). 41--48.Google Scholar
Haddo van Hasselt and Marco Wiering. 2007. Reinforcement learning in continuous action spaces. In Proceedings of the IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning. IEEE Los Alamitos, CA, 272--279.Google ScholarCross Ref

Index Terms

Probabilistic Policy Reuse for Safe Reinforcement Learning
1. Computing methodologies
  1. Artificial intelligence
    1. Control methods
  2. Machine learning
    1. Machine learning algorithms
      1. Dynamic programming for Markov decision processes
        Temporal difference learning
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory
      1. Reinforcement learning

Recommendations

Probabilistic policy reuse in a reinforcement learning agent
AAMAS '06: Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems

We contribute Policy Reuse as a technique to improve a reinforcement learning agent with guidance from past learned similar policies. Our method relies on using the past policies as a probabilistic bias where the learning agent faces three choices: the ...
Read More
Multi-objective safe reinforcement learning: the relationship between multi-objective reinforcement learning and safe reinforcement learning
Abstract
Reinforcement learning (RL) is a learning method that learns actions based on trial and error. Recently, multi-objective reinforcement learning (MORL) and safe reinforcement learning (SafeRL) have been studied. The objective of conventional RL is ...
Read More
Probabilistic Policy Reuse for inter-task transfer learning

Policy Reuse is a reinforcement learning technique that efficiently learns a new policy by using past similar learned policies. The Policy Reuse learner improves its exploration by probabilistically including the exploitation of those past policies. ...
Read More

Reviews

Reviewer: Giuseppina Carla Gini

Human and robot planners seek safe and optimal action plans. Learning to adapt good examples, such as the ones provided by a teacher, is an effective way to speed up action planning and can be used to start reinforcement learning. When learning is continuously performed during real actions in a stochastic domain, the same action can lead the agent to different situations; so, repeating a plan does not work and exploration is necessary, with the connected risk of taking the agent to unseen and dangerous situations. Learning during execution and optimizing the plan are indeed conflicting tasks. This paper proposes an algorithm to control how to interplay the two processes. A continuously increasing monotonic risk function, integrated into the policy reuse strategy, is parameterized to guarantee safety, avoiding unknown situations, or optimization, allowing more exploration. The results, discussed in a helicopter problem and in a competition among agents, show how parameter tuning can effectively modify the learning. The paper is well written and easy to follow, the motivations and the resulting algorithm are analyzed, and the proposed solution is appealing. I am looking for more examples, especially in real life, where sensor readings add uncertainties and perhaps more tuning is necessary.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Autonomous and Adaptive Systems Volume 13, Issue 3
September 2018
98 pages
ISSN:1556-4665
EISSN:1556-4703
DOI:10.1145/3320018
Editor:
Bashar Nuseibeh
The Open University, UK, 8 Lero, Ireland
Issue’s Table of Contents
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 March 2019
- Revised: 1 January 2019
- Accepted: 1 January 2019
- Received: 1 July 2017
Published in taas Volume 13, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Reinforcement learning
case-based reasoning
software agents
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 471
  Total Downloads
- Downloads (Last 12 months)26
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Probabilistic Policy Reuse for Safe Reinforcement Learning

ACM Transactions on Autonomous and Adaptive Systems

Abstract

References

Cited By

Index Terms

Recommendations

Probabilistic policy reuse in a reinforcement learning agent

Multi-objective safe reinforcement learning: the relationship between multi-objective reinforcement learning and safe reinforcement learning

Probabilistic Policy Reuse for inter-task transfer learning

Reviews

Access critical reviews of Computing literature here