Skip to content

A Reinforcement Learning space to test a variety of algorithms with a variety of environments, both with single and multiple agents.

Notifications You must be signed in to change notification settings

finn1y/RLTrainingEnv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reinforcement Learning (RL) Training Environment

A training environment for reinforcement learning algorithms using open-ai gym, as first suggested by Brockman et al in [1]

Included Algorithms

Both single agent and multi-agent reinforcement learning algorithms are included within the learning environment Any single agent algorithm can be used as an independent multi-agent learning algorithm with the multi-agent environments

All algorithms using gradient descent on a neural network use the Adam optimiser, first proposed by Kingma et al in [2]. They also use the Huber loss function in place of the mean squared error where approriate, suggested by Huber in [3]. These functions are implemented as part of the tensorflow library.

All algorithm implementations also include exploration rate (epsilon) and learning rate (alpha) decay where appropriate; this can be removed by setting the relevant decay rate to 1.

Q-learning is implemented based on the algorithm described by Sutton and Barto in [4].

Deep Q-network is implemeneted based on the algorithm described by Minh et al in [5]. However it does not use CNNs as the environments used in this training are not array based (i.e. not an RGB array screen representation)

Deep Recurrent Q-Network is implemented based on the alterations to DQN as suggested by Hausknecht and Stone in [6]. Similarly to DQN CNNs are not used. (Note this algorithm is implemented as DQN with a DRQN flag to change the first neural net layer)

Policy Gradient is implemented using the policy gradient equation as derived by Sutton in [7] and its counterpart in [8] by Silver et al. The algorithm is similar to the REINFORCE algorithm as suggested by Williams in [9]

Advantage Actor Critic is implemented based on one of the actor critic variations suggested by Bhatnagar et al in [10].

Deep Deterministic Policy Gradient is implemented based on the algorithm as suggested in [11] by Lillicrap et al.

Multi-Agent Actor Critic (MA Actor Critic)

Multi-Agent Actor Critic is implemented based on a the algorithm described by Lowe et al in [12]. As the multi-agent environments are cooperative there is communication of agent policy so no policy inference is required nor are policy ensembles.

Distributed Deep Recurrent Q-Network is implemented based on the changes to Deep Q-Networks suggested by Foerster et al in [13] for multi-agent environments. Due to the nature of this simulation instead of direct inter-agent weight sharing (i.e. directly tying all network weights) agents share weights via communication each updating the their network parameters in turn and then communicating the updated weights to the next agent until all agents have performed their updates.

Algoithm I/O

Algorithm State space Action space
Q-Learning Discrete Discrete
DQN Continuous/Discrete Discrete
DRQN Continuous/Discrete Discrete
PG Continuous/Discrete Discrete
A2C Continuous/Discrete Discrete
DDPG Continuous/Discrete Continuous
MAAC Continuous/Discrete Discrete
DDRQN Continuous/Discrete Discrete

Included Environments

Several of openai gyms' environments are included as single agent environments are included, as well as some custom environments which have both single agent and multi-agent variations

A simple openai gym maze environment written by GitHub user 'MattChanTK' in [14]. A patch which adds multi-agent functionality to the environment has been included in this repository for this environment.

An openai gym maze envrionment written by GitHub user 'finn1y' in [15].

An openai gym environment shipped with openai gym under classic control environments

An openai gym environment shipped with openai gym under classic control environments

An openai gym environment shipped with openai gym under classic control environments

An openai gym environment shipped with openai gym under classic control environments

An openai gym environment shipped with openai gym under classic control environments

Environment I/O

Environment State space Action space
Maze Discrete Discrete
Robot Maze Continuous Discrete
Cart Pole Continuous Discrete
Acrobot Continuous Discrete
Mountain Car Continuous Discrete
Mountain Car Continuous Continuous Continuous
Pendulum Continuous Continuous

Install

  1. Clone the repo
git clone https://github.com/finn1y/RLTraingingEnv
  1. Install python dependencies in repo
cd RLTrainingEnv
pip install -r requirements.txt
  1. Apply dependency patches, described in patches
  2. Enjoy training some RL agents!

References

[1] G. Brockman, V. Cheung, L. Pettersson et al, "OpenAI Gym", arXiv:1606.01540v1 [cs.LG], 2016. Available: link [Accessed 2 Feb 2022]

[2] D. P. Kingma and J. L. Ba, "Adam: A Method for Stochastic Optimisation", arXiv:1412.6980 [cs.LG], 2015. Available: link [Accessed 9 Feb 2022]

[3] P. J. Huber, “Robust Estimation of a Location Parameter”, The Annals of Mathematical Statistics 35(1), 1964, pp. 73-101.

[4] R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduction, 2nd ed. The MIT Press, 2018.

[5] V. Mnih, K. Kavukcuoglu, D. Silver et al, “Human-level control through deep reinforcement learning”, Nature 518, 2015, pp. 529-533. Available: link [Accessed 2 Feb 2022]

[6] M. Hausknecht and P. Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”, arXiv:1507.06527v4 [cs.LG], 2017. Available: link [Accessed 2 Feb 2022]

[7] R.S. Sutton, D.A. McAllester, S.P. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation”, Advances in neural information processing systems 12, 1999, pp. 1057–1063.

[8] D. Silver, G. Lever, N. Heess et al, “Deterministic policy gradient algorithms”, Proceedings of the 31st International Conference on Machine Learning, 2014, pp. 387–395.

[9] R. J. Williams, “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning”, Machine Learning 8, 1992, pp. 229-256.

[10] S. Bhatnagar, R. Sutton, M. Ghavamzadeh and M. Lee, "Natural Actor-Critic Algorithms", Automatica 45, 2009, pp. 2471-2482.

[11] T.P. Lillicrap, J.J. Hunt, A. Pritzel et al, “Continuous Control with Deep Reinforcement Learning”, arXiv:1509.02971v6 [cs.LG], 2019. Available: link [Accessed 2 Feb 2022]

[12] R. Lowe, Y. Wu, A. Tamar et al, “Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments”, arXiv:1706.02275v4 [cs.LG], 2020. Available: link [Accessed 2 Feb 2022]

[13] J.N. Foerster, Y.M. Assael, N. de Freitas et al, “Learning to Communicate to Solve Riddles with Deep Distributed Recurrent Q-Networks”, arXiv:1602.02672 [cs.AI], 2016. Available: link [Accessed 9 Feb 2022]

[14] M. Chan, "gym-maze", GitHub, 2020. Available: link [Accessed 2 Feb 2022]

[15] F. Middleton-Baird, "gym-robot-maze", GitHub, 2021. Available: link [Accessed 2 Feb 2022]