The methods presented below integrate model-based reinforcement learning (RL) and active learning with the objective of minimizing both the number of action executions and the teacher demonstration requests. These approaches learn rule models that can be used by planners.

The code is available at bitbucket.

Documentation for the code is available here.

V-MIN extends REX [4] by including a teacher in the loop to reduce the number of actions required to learn.

The result is that V-MIN can learn models even if exploration is very scarce. If an important state-action pair is not visited, teacher demonstration are requested until the agent learns a model that can obtain values larger than Vmin.

V-MIN has the following features:

Video comparing REX and V-MIN

The video below compares REX with V-MIN in the AUTAS scenario. The video speed is slower during the first episodes to show with more detail the teacher demonstration requests and the exploration.

The algorithms solve three problems with 5 episodes each. AUTAS 1 is the standard one, while AUTAS 2 and AUTAS 3 show new unexpected cases where previously unknown actions are required (and thus a new demonstration is required in V-MIN).

Demonstration Request Exploration Exploitation Rules
Image Image Image Image

REX (RL without demonstrations)

V-MIN (RL + active learning)

In the work presented before, the agent actively requests demonstrations to a teacher. However, the teacher may not know what parts of the model are unknown. Here we analyze the model to decide which parts are causing the planner to find bad solutions, and use these causes to provide guidance to the teacher.

To explain the planning errors, Göbelbecker et al. [5] designed a method to find excuses, that are changes to the state that make the planner find a solution. Based on these excuses, we can provide feedback to the teacher about possible wrong preconditions or unknown needed effects that may be the cause that make the planner fail.


An example is shown in the video below. Here the robot has to learn a new action to reposition a shaft in a vertical position. When the planner cannot obtain a solution, the robot tells to the teacher that the placeShaft action requires a non horizontal position for the shaft, but it cannot be obtained.

In real-world domains, there are usually sequences of actions that, if executed, may produce unrecoverable errors (e.g. breaking an object). Robots should avoid repeating such errors when learning, and thus explore the state space in a more intelligent way. Robots should reason about dead-ends and their causes, and once dangerous actions are identified, the RL algorithm can avoid them.

We show this in a tableware clearing task.

[1] V-MIN: Efficient reinforcement learning through demonstrations and relaxed reward demands
D. Martínez, G. Alenyà, and C. Torras
Proceedings of the AAAI Conference on Artificial Intelligence, 2015, pp. 2857–2863

PDF Bibtex Code

[2] Relational reinforcement learning with guided demonstrations
D. Martínez, G. Alenyà, and C. Torras
Artificial Intelligence, 247: 295-312, 2017

PDF Bibtex Code

[3] Safe robot execution in model-based reinforcement learning
D. Martínez, G. Alenyà, and C. Torras
IEEE/RSJ International Conference on Intelligent Robots and Systems, 2015, pp. 6422-6427

PDF Bibtex

[4] Exploration in relational domains for model-based reinforcement learning
T. Lang, M. Toussaint, and K. Kersting
The Journal of Machine Learning Research, 2012, 13(1), pp. 3725–3768

[5] Coming Up With Good Excuses: What to do When no Plan Can be Found
M. Göbelbecker, T. Keller, P. Eyerich, M. Brenner, and B. Nebel
International Conference on Automated Planning and Scheduling, 2010, pp. 81–88