Reinforcement Learning (RL) Training Environment

A training environment for reinforcement learning algorithms using open-ai gym, as first suggested by Brockman et al in [1]

Included Algorithms

Both single agent and multi-agent reinforcement learning algorithms are included within the learning environment Any single agent algorithm can be used as an independent multi-agent learning algorithm with the multi-agent environments

All algorithms using gradient descent on a neural network use the Adam optimiser, first proposed by Kingma et al in [2]. They also use the Huber loss function in place of the mean squared error where approriate, suggested by Huber in [3]. These functions are implemented as part of the tensorflow library.

All algorithm implementations also include exploration rate (epsilon) and learning rate (alpha) decay where appropriate; this can be removed by setting the relevant decay rate to 1.

Q-Learning

Q-learning is implemented based on the algorithm described by Sutton and Barto in [4].

Deep Q-Network (DQN)

Deep Q-network is implemeneted based on the algorithm described by Minh et al in [5]. However it does not use CNNs as the environments used in this training are not array based (i.e. not an RGB array screen representation)

Deep Recurrent Q-Network (DRQN)

Deep Recurrent Q-Network is implemented based on the alterations to DQN as suggested by Hausknecht and Stone in [6]. Similarly to DQN CNNs are not used. (Note this algorithm is implemented as DQN with a DRQN flag to change the first neural net layer)

Policy Gradient (PG)

Policy Gradient is implemented using the policy gradient equation as derived by Sutton in [7] and its counterpart in [8] by Silver et al. The algorithm is similar to the REINFORCE algorithm as suggested by Williams in [9]

Advantage Actor Critic (A2C)

Advantage Actor Critic is implemented based on one of the actor critic variations suggested by Bhatnagar et al in [10].

Deep Deterministic Policy Gradient (DDPG)

Deep Deterministic Policy Gradient is implemented based on the algorithm as suggested in [11] by Lillicrap et al.

Multi-Agent Actor Critic (MA Actor Critic)

Multi-Agent Actor Critic is implemented based on a the algorithm described by Lowe et al in [12]. As the multi-agent environments are cooperative there is communication of agent policy so no policy inference is required nor are policy ensembles.

Distributed Deep Recurrent Q-Network (DDRQN)

Distributed Deep Recurrent Q-Network is implemented based on the changes to Deep Q-Networks suggested by Foerster et al in [13] for multi-agent environments. Due to the nature of this simulation instead of direct inter-agent weight sharing (i.e. directly tying all network weights) agents share weights via communication each updating the their network parameters in turn and then communicating the updated weights to the next agent until all agents have performed their updates.

Algoithm I/O

Algorithm	State space	Action space
Q-Learning	Discrete	Discrete
DQN	Continuous/Discrete	Discrete
DRQN	Continuous/Discrete	Discrete
PG	Continuous/Discrete	Discrete
A2C	Continuous/Discrete	Discrete
DDPG	Continuous/Discrete	Continuous
MAAC	Continuous/Discrete	Discrete
DDRQN	Continuous/Discrete	Discrete

Included Environments

Several of openai gyms' environments are included as single agent environments are included, as well as some custom environments which have both single agent and multi-agent variations

Maze

A simple openai gym maze environment written by GitHub user 'MattChanTK' in [14]. A patch which adds multi-agent functionality to the environment has been included in this repository for this environment.