Abstract
This work introduces Policy Reuse for Safe Reinforcement Learning, an algorithm that combines Probabilistic Policy Reuse and teacher advice for safe exploration in dangerous and continuous state and action reinforcement learning problems in which the dynamic behavior is reasonably smooth and the space is Euclidean. The algorithm uses a continuously increasing monotonic risk function that allows for the identification of the probability to end up in failure from a given state. Such a risk function is defined in terms of how far such a state is from the state space known by the learning agent. Probabilistic Policy Reuse is used to safely balance the exploitation of actual learned knowledge, the exploration of new actions, and the request of teacher advice in parts of the state space considered dangerous. Specifically, the π-reuse exploration strategy is used. Using experiments in the helicopter hover task and a business management problem, we show that the π-reuse exploration strategy can be used to completely avoid the visit to undesirable situations while maintaining the performance (in terms of the classical long-term accumulated reward) of the final policy achieved.
- Agnar Aamodt and Enric Plaza. 1994. Case-based reasoning; foundational issues, methodological variations, and system approaches. AI Communications 7, 1 (1994), 39--59. Google ScholarDigital Library
- Brenna D. Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. 2009. A survey of robot learning from demonstration. Robotics and Autonomous Systems 57, 5 (May 2009), 469--483. Google ScholarDigital Library
- Fernando Borrajo, Yolanda Bueno, Isidro de Pablo, Begoña Santos, Fernando Fernández, Javier García, et al. 2010. SIMBA: A simulator for business education and research. Decision Support Systems 48, 3 (2010), 498--509. Google ScholarDigital Library
- Yin Cheng and Weidong Zhang. 2018. Concise deep reinforcement learning obstacle avoidance for underactuated unmanned marine vessels. Neurocomputing 272 (2018), 63--73. Google ScholarDigital Library
- Marc Peter Deisenroth, Gerhard Neumann, and Jan Peters. 2013. A survey on policy search for robotics. Foundations and Trends in Robotics 2, 1-2 (2013), 1--142. Google ScholarDigital Library
- Fernando Fernández and Manuela Veloso. 2006. Probabilistic policy reuse in a reinforcement learning agent. In Proceedings of the 5th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS’06). ACM, New York, NY, 720--727. Google ScholarDigital Library
- Fernando Fernández and Manuela M. Veloso. 2013. Learning domain structure through probabilistic policy reuse in reinforcement learning. Progress in Artificial Intelligence 2, 1 (2013), 13--27.Google ScholarCross Ref
- Javier García, Fernando Borrajo, and Fernando Fernández. 2012. Reinforcement learning for decision-making in a business simulator. International Journal of Information Technology and Decision Making 11, 05 (2012), 935--960.Google ScholarCross Ref
- Javier García and Fernando Fernández. 2012. Safe exploration of the state and action spaces in reinforcement learning. Journal of Artificial Intelligence Research 45 (2012), 515--564. Google ScholarDigital Library
- Peter Geibel and Fritz Wysotzki. 2005. Risk-sensitive reinforcement learning applied to control under constraints. Journal of Artificial Intelligence 24 (2005), 81--108. Google ScholarDigital Library
- Todd Hester and Peter Stone. 2013. TEXPLORE: Real-time sample-efficient reinforcement learning for robots. Machine Learning 90, 3 (March 2013), 385--429. Google ScholarDigital Library
- Jonathan Ho and Stefano Ermon. 2016. Generative adversarial imitation learning. In Proceedings of Advances in Neural Information Processing Systems 29. 4565--4573. Google ScholarDigital Library
- Bing-Qiang Huang, Guang-Yi Cao, and Min Guo. 2005. Reinforcement learning neural network to the problem of autonomous mobile robot obstacle avoidance. In Proceedings of 2005 International Conference on Machine Learning and Cybernetics, Vol. 1. IEEE, Los Alamitos, CA, 85--89.Google Scholar
- Leslie Pack Kaelbling, Michael L. Littman, and Andrew P. Moore. 1996. Reinforcement learning: A survey. Journal of Artificial Intelligence Research 4 (1996), 237--285. Google ScholarDigital Library
- Jens Kober and Jan Peters. 2011. Policy search for motor primitives in robotics. Machine Learning 84, 1-2 (2011), 171--203. Google ScholarDigital Library
- Rogier Koppejan and Shimon Whiteson. 2011. Neuroevolutionary reinforcement learning for generalized control of simulated helicopters. Evolutionary Intelligence 4, 4 (Dec. 2011), 219--241.Google ScholarCross Ref
- Manon Legrand. 2017. Deep Reinforcement Learning for Autonomous Vehicle Control Among Human Drivers. Ph.D. Dissertation. Universite Libre de Bruxelles.Google Scholar
- Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. 2016. End-to-end training of deep visuomotor policies. Journal Machine Learning Research 17, 1 (Jan. 2016), 1334--1373. Google ScholarDigital Library
- J. A. Martin H and J. de Lope. 2009. Ex<a>: An effective algorithm for continuous actions Reinforcement Learning problems. In IProceedings of the 35th Annual Conference of IEEE Industrial Electronics (IECON’09). IEEE, Los Alamitos, CA, 2063--2068.Google Scholar
- José Antonio Martín H. and Javier de Lope. 2009. Learning autonomous helicopter flight with evolutionary reinforcement learning. In Computer Aided Systems Theory—EUROCAST 2009, R. Moreno-Díaz, F. Pichler, and A. Quesada-Arencibia (Eds.). Springer, Berlin, Germany, 75--82.Google Scholar
- Risto Miikkulainen. 2017. Evolution of neural networks. In Proceedings of the Genetic and Evolutionary Computation Conference Companion. ACM, New York, NY, 450--470. Google ScholarDigital Library
- Jean-Baptiste Mouret and Konstantinos I. Chatzilygeroudis. 2017. 20 years of reality gap: A few thoughts about simulators in evolutionary robotics. In Companion Material Proceedings of the Genetic and Evolutionary Computation Conference. 1121--1124. Google ScholarDigital Library
- Andrew Ng, Jin Kim, Michael Jordan, and Shankar Sastry. 2003. Autonomous helicopter flight via reinforcement learning. In Proceedings of the 16th International Conference on Neural Information Processing Systems (NIPS’03). 799--806. Google ScholarDigital Library
- Jan Peters and Stefan Schaal. 2008. Reinforcement learning of motor skills with policy gradients. Neural Networks 21, 4 (May 2008), 682--697. Google ScholarDigital Library
- Stephane Ross, Geoffrey Gordon, and Drew Bagnell. 2011. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. 627--635.Google Scholar
- Juan C. Santamaría, Richard S. Sutton, and Ashwin Ram. 1998. Experiments with reinforcement learning in problems with continuous state and action spaces. Adaptive Behavior 6 (1998), 163--218. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.8280. Google ScholarDigital Library
- Bradly C. Stadie, Sergey Levine, and Pieter Abbeel. 2015. Incentivizing exploration in reinforcement learning with deep predictive models. In Advances in Neural Information Processing Systems. 2750--2759.Google Scholar
- Richard S. Sutton and Andrew G. Barto. 1998. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA. Google ScholarDigital Library
- Lei Tai, Giuseppe Paolo, and Ming Liu. 2017. Virtual-to-real deep reinforcement learning: Continuous control of mobile robots for mapless navigation. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’17). IEEE, Los Alamitos, CA, 31--36.Google ScholarCross Ref
- Matthew E. Taylor, Brian Kulis, and Fei Sha. 2011. Metric learning for reinforcement learning agents. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems (AAMAS’11). 777--784. Google ScholarDigital Library
- Matthew E. Taylor, Halit Bener Suay, and Sonia Chernova. 2011. Integrating reinforcement learning with human demonstrations of varying ability. In Proceedings of the 10th International Conference on Autonomous Agents and Multiagent Systems (AAMAS’11), Vol. 2. 617--624. Google ScholarDigital Library
- Lisa Torrey and Matthew E. Taylor. 2012. Help an agent out: Student/teacher learning in sequential decision tasks. In Proceedings of the Adaptive and Learning Agents Workshop (AAMAS’12). 41--48.Google Scholar
- Haddo van Hasselt and Marco Wiering. 2007. Reinforcement learning in continuous action spaces. In Proceedings of the IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning. IEEE Los Alamitos, CA, 272--279.Google ScholarCross Ref
Index Terms
- Probabilistic Policy Reuse for Safe Reinforcement Learning
Recommendations
Probabilistic policy reuse in a reinforcement learning agent
AAMAS '06: Proceedings of the fifth international joint conference on Autonomous agents and multiagent systemsWe contribute Policy Reuse as a technique to improve a reinforcement learning agent with guidance from past learned similar policies. Our method relies on using the past policies as a probabilistic bias where the learning agent faces three choices: the ...
Multi-objective safe reinforcement learning: the relationship between multi-objective reinforcement learning and safe reinforcement learning
AbstractReinforcement learning (RL) is a learning method that learns actions based on trial and error. Recently, multi-objective reinforcement learning (MORL) and safe reinforcement learning (SafeRL) have been studied. The objective of conventional RL is ...
Probabilistic Policy Reuse for inter-task transfer learning
Policy Reuse is a reinforcement learning technique that efficiently learns a new policy by using past similar learned policies. The Policy Reuse learner improves its exploration by probabilistically including the exploitation of those past policies. ...
Comments