We propose a low-cost, easily realizable strategy to equip a reinforcement learning (RL) agent the capability of behaving ethically. Our model allows the designers of RL agents to solely focus on the task to achieve, without having to worry about the implementation of multiple trivial ethical patterns to follow. Based on the assumption that the majority of human behavior, regardless which goals they are achieving, is ethical, our design integrates human policy with the RL policy to achieve the target objective with less chance of violating the ethical code that human beings normally obey. Please refer to the paper for more details. If you find this work useful in your research, please cite:
@article{wu2017low,
title={A Low-Cost Ethics Shaping Approach for Designing Reinforcement Learning Agents},
author={Wu, Yueh-Hua and Lin, Shou-De},
journal={arXiv preprint arXiv:1712.04172},
year={2017}
}
Packages used:
- numpy
- pandas
Python version: 3.5.2 or later
For detailed settings of the experiments, please refer to the Experiments section in the paper. Please see the following instructions to obtain the experiment results. The results will be saved in the record folder.
To see the performance without human trajectories,
cd ./Milk/
python sarsa.py
To see the performance with human trajectories, please make sure hpolicy_milk.pkl file exists. If not, generate the human trajectories by
python human_policy.py
python sarsa.py --ethical
There are two experiments called Driving and Avoiding and Driving and Rescuing. In both cases, there are cars and cats in five lanes. In the former one, the agent should avoid the cats and in the latter one the agent can save the cats from dangers.
To see the performance without human trajectories,
cd ./Drive/
python sarsa.py
For Driving and Avoiding, to see the performance with human trajectories, please make sure hpolicy_drive_n.pkl exists. If not, generate the human trajectories by
python hsarsa_n.py
python sarsa.py --n_ethical
Similarly, for Driving and Rescuing, please check the existence of the hpolicy_drive_p.pkl file and use
python hsarsa_p.py
python sarsa.py --p_ethical
In the figures, neg_passed suggests the number of babies getting annoyed; therefore, the lower the better. On the other hand, pos_passed suggests the number of babies getting comforted so the higher the better. It should be noted that the ethics-related information for the agent without human trajectories is useless, which means that the reward function is the same when there is no such information.
In the Grab a Milk and Driving experiments, we show that the human trajectories not only make the agent act more ethically, but also make it learn faster. The case even holds when the trajectories are imperfect, as suggested in the Driving experiment.
Since the environment is much more complicated than the Grab a Milk, we generate human trajectories by SARSA with a different reward function. We intentionally weaken the driving skills of the human agent slightly so as to demonstrate that our agent is able to learn from imperfect data.
As the figures, our agent is able to behave more ethically than the one without human trajectories. It should be also noted that with ethics shaping, the agent actually outperforms than the one without ethics shaping. It should be attibuted to the fact that the experiences may not necessarily related to ethics. It can also be beneficial to the learning process.