Skip to content

Latest commit

 

History

History
executable file
·
381 lines (269 loc) · 26.3 KB

File metadata and controls

executable file
·
381 lines (269 loc) · 26.3 KB

DRL Toolbox

Overview

This repository contains code reinforcement learning code for solving the Udacity Deep reinforcement Learning projects.

Prerequisites

  • Anaconda
  • Linux with GPU
  • CUDA 10.2

All models are developed in Pytorch

Install

Recreate the Anaconda environment with:
conda env create -f environment.yml

Repository Structure

The code is organized as follows:

  • /agents

    • Agents are responsible for interacting with the environment -- given a state provided from the environment, the agent should be able to choose an action. At the next time step the agent will receive a reward for that action. The goal of the agent is to choose actions in order to maximize the long-term (cumulative) reward.

    • /agents/models:

      • Models are interchangeable components representing an agent's brain. The current implementations of models are parametric and invoke neural networks.

      • The input to a model is a State, and the output is a representation Q(st, at; θ) of the value of the (s, a) pair

      • /agents/models/components:

        • Model components are torch.nn.Modules which are combined to form models. In the context of the DQN model, these components form the featurizer and outputs of the DQN model.
    • /agents/policies:

      • Policies are responsible for choosing the next action given the current state. Policies utilize the parametric models from /agents/models for the purpose of generating actions given a state.
      • The Policy classes are responsible for taking Q(at, st), generated from a Model and outputting an Action
    • /agents/memory:

      • Implements memory buffers for the agent, used for storing experiences sampled from the environment and recalling experiences during learning
  • /simulation

    • The simulation modules are responsible for emulating the the environment, and provide helper functions for training and evaluating an agent. The simulator provides environmental state for the agent, receives an action, and generates the next environmental state along with a scalar reward signal for the agent.
  • /tasks

    • Tasks are the learning tasks (or problems) which we develop DRL agents to solve.
    • /tasks/<task_name>
      • The solution and discussion for a particular task can be found in these folders
  • /tools

    • Contains common tools for solving tasks

Task solutions and reports

Quick navigation

Agent Implementations and explanation

Currently only the Deep Q-Network algorithm is implemented, along with a few extensions to the original algorithm outlined in the Rainbow DQN implementation. Below, we discuss the algorithm at a high level, along with the implemented extensions.

  • Overview

    Q-Learning is a value-based reinforcement learning algorithm which can be used to learn an optimal policy (strategy) for selection actions in an environment. Being value based, the Q-learning algorithm aims to identify the expected return (cumulative sum of future rewards):

    Q : S x A -> ℝ, for all s ∈ S, a ∈ A

    Qπ(st, at) = ℰ[Rt, γRt+1, γ2Rt+2 + ... | st, at]

    Where π is the policy being followed, t is the current time step, and Rt is the reward (provided by the environment) at time t.

    The goal in Q-learning is to find the optimal action-value function, or:

    Q*(st, at) = maxπ ℰ[Rt, γRt+1, γ2Rt+2 + ... | st, at]

    The optimal action value function thus the maximum expected return which can be achieved by taking action a in state s.

    The optimal action-value function obeys the Bellman equation, which allows Q*(st, at) to be writting as a recurrence relation:

    Q(st, at) = ℰ[Rt + γ maxat+1 Q(st+1, at+1) | st, at]

    Which can then be cast as an iterative update rule as:

    Qi+1(st, at) = ℰst+1[Rt + γ maxat+1 Qi(st+1, at+1) | st, at]

    Where i indicates the update step. The Deep Q-Network Algorithm uses parametric (neural network) models as function approximators to estimate the value of Q(s, a), resulting in iterative loss equations of:

    Lii) = ℰst,at,rt [ℰst+1([Rt + γ maxat+1 Qi(st+1, at+1; θ-)] - Q(st, at; θi))2]

    Where the target network's weights θ- represents a set of parameters which are periodically copied (synced) with teh online network's weights, θ, every τ time steps, and held fixed otherwise. For convenience, we can write the target value function as:

    YtDQN ≡ Rt + γ maxat+1 Q(st+1, at+1; θt-)

    The double DQN algorithm is a minor but significant adaptation to the DQN algorithm which helps reduce over-estimation of action values Q(s, a). The traditional DQN algorithm has:

    YtDQN ≡ Rt + γ maxat+1 Q(st+1, at+1; θt-) = Rt + γQ(St+1, argmaxaQ(St+1, a;θt); θt)

    where the same θt is used to compute both the current state-values and next state-values, where we've used the fact that at+1 is approximated by argmaxa(Q(St+1, a;θt). Because the same parametric model is used to both select and to evaluate an action, the Double DQN authors show that over-estimation of value estimates can occur.

    The double DQN Algorithm was shown to significantly reduce over-estimation bias in action-value estimates by using the target network (parametrized by θ-) to evaluate the action, while using the online network (parametrized by θ) for the action selection.

    YtDouble-DQN ≡ Rt + γQ(st+1, argmaxa Q(st+1, a; θt) θt-)

    See the compute_errors method of the Base Policy class for code implementation

    Rather than performing learning updates on experiences as they are sampled from the environment (i.e. sequentially through time), the DQN algorithm uses an experience buffer to store experiences, which are then sampled in a probabilistic way to break correlations between sequential experiences. In this implementation, experiences are stored as tuples of (st, at, rt, st+1) corresponding to the state, action, reward and next_state, respectively.

    During learning, experiences tuples (st, at, rt, st+1) are drawn non-uniformly from the replay buffer according according to the error between the predicted and expected action-value for the given experience. Experience samples which result in greater error are given a higher probability for being sampled.

    Samples are added to the replay buffer such that the probability of being sampled is:

    P(i) = piα / ∑k pkα

    Where pi > 0 is the priority of sample i, and α is used to control how much the priotiziation is used, with α=0 representing the case of uniform sampling.

    To determine the priority of an experience tuple, TD Error is used (and can be applied to either the DQN or Double-DQN case). We show the case for the Double-DQN, since that is what's implemented in the code:

    TD Error = Rt + γQ(st+1, argmaxa Q(st+1, a; θt) θt-) - Q(st, at; θ)

    In the code, we use the L2 norm of the TD error as our un-normalized priority values pi.

    Due to the introduced bias from non-uniform sampling of experiences, the gradients are weighted by importance-sampling weights as a correction:

    wi = (1/N * 1/P(i)) β

    A SumTree data structure is implemented to perform weighted sampling efficiently. See the implementation of the PER buffer, and the SumTree.

    See the compute_errors method of the Base Policy class shows where importance weights are applied to scale the gradients, and step method of the DQNAgent contains the implementation of updating the priorities.

    The dueling DQN network architecture attempts to decompose Q(s, a) as the sum of the value, V(s) of a state, and the advantage, A(s, a) of taking an action in that state (advantage over all other possible actions from that state).

    Q(s, a) = A(s, a) + V(s)

    To accomplish this, the output-component of the DQN is separated into two steams, one for V(s) and one for A(s, a). These two streams are then aggregated to obtain the final Q(s, a) values.

    By decoupling, the Dueling DQN learns the value of a state independently from the value of actions from that state, which is particularly useful if actions to not affect the value of a state in a relevant way.

    For implementation details, see the _get_dueling_output method of the dqn.

    The Noisy DQN output replaces the traditional ε-greedy exploration technique for a neural-network layer whose weights and biases are perturbed by a parametric function which is learned with the rest of the network. Noisy networks have the advantage over ε-greedy in that the amount of noise they inject into the network is learned rather than annealed in a heuristic manner.

    For implementation details, see the get_output method of the dqn.

    The categorical DQN algorithm attempts to model the return distribution for an action, rather than the expected return, thus modelling the distribution of Q(s, a). The categorical DQN is implemented in the get_output method of dqn, with corresponding categorical policy which is responsible for computing the errors between the target and online network distributions. Please refer to the paper for theoretical details and to this reference implementation, from which the code is adapted from.

  • Overview

    DDPG shares many commonalities with the DQN network, but is designed to accommodate continuous action spaces. DDPG displays an actor-critic architecture, where the critic samples mini-batches of experience from a replay buffer to calculate a bootstrapped TD target for training the Q-function off-policy, and the actor is trained on-policy to learn the optimal action to take in a given state.

    The major difference between the DQN and DDPG can be explained as follows. The DQN loss function at each iteration can be describes as:

    Ltt) = ℰ(s,a,r,s')[(r + γ Q(s', argmaxa'Q(s', a'; θ-); θ- - Q(s, a; θt)))2]

    which uses the argmax of the q-function of the next state to obtain the greedy action.

    In contrast, the DDPG loss at time step t is given by:

    Ltt) = ℰ(s,a,r,s')[(r + γ Q(s', μQ(s'; φ-); θ- - Q(s, a; θt)))2]

    Where the argmax has been replaced by a learned policy function μ parametrized by φ, which learns the deterministic greedy action from the current state. Note that both actor and critic networks use the target-network approach as in DQN.

    The actor/policy network μ(-; φt) is trained to find the action which maximizes the expected q-value, where the loss is given by:

    Jii) = ℰs [Q(s, μ(s; φ);θ)]

    Where the states s are sampled from the replay buffer, and in practice are the same states used to update the critic network.

    To allow for exploration with deterministic policies, we inject Gaussian noise into the actions selected by the policy.

    Prioritized experience replay

    As in DQN, we can replace the uniformly-sampled replay buffer with a prioritized replay buffer

    The TD3 paper introduces a number of improvements to the classic TD3 algorithm which improve performance, given below:

    • Double Learning/ twin loss function
      • This improvement splits the critic model into two separate and independent (unless weight sharing is used) networks, where only the optimizer is shared, creating two separate estimates for the value of the state-action pair. The joint loss function is computed as the sum of the MSEs of each of the two streams:

        Jtwin, t = Jtta) + Jttb)

        for the two streams a and b, with:

        Jtta) = ℰs, a, r, s'[(TWIN target - Q(s, a; θta))2]

        Jttb) = ℰs, a, r, s'[(TWIN target - Q(s, a; θtb))2]

        TWIN target = r + γ minn Q(s', μ(s';φ-); θn, -)

        where we use the target q-value is the minimum q-value obtained from the two streams, using target networks for both policy and value network.

    • Smoothing targets for policy updates
      • In DDPG, Gaussian noise was added to the actions used for exploration. In TD3, this is extended by adding noise to the actions used to calculate the targets, which can be seen as regularization as the network is forced to learn similarities between actions, and helps especially during the beginning of learning where it is most likely for the network to converge to incorrect actions.

      a',smooth = clamp(μ(s';φ-)) + clamp(ε, εlow, εhigh), actionlow, actionhigh)

      where clamp(x, l, h) = max(min(x, h), l) is the clamping function, ε is an array of Gaussian noise, and εlow/high and actionlow/high represent the minimum/maximum of the sampled noise and actions, respectively.

      The smooth TD target is then:

      TD3target = r + γminnQ(s', a', smooth; θn, -)

    • Delaying policy updates
      • The last TD3 improvement is to update the online Q-function at a higher frequency than both the policy network updates and the soft-copying of target networks -> online networks. Delaying updates in this manner allows the value function to stabilize into more accurate values prior to passing it to the policy network. The typical delay parameter used is τ=2, meaning that the policy and target networks are updated every other update compared to the online Q-function.
  • Overview

    The MADDPG algorithm attempts to address the challenges of implementing single-agent algorithms in a multi-agent setting, which generally results in poor performance due to 1) the non stationarity of the environment due to the presence of other agents, and 2) the exponential increase in state/action spaces resulting from modeling multiple interacting agents (due to the curse of dimensionality). The MADDPG algorithm is an example of a class of algorithms referred to as "centralized planning with decentralized execution".

    Centralized Planning

    The concept of centralized planning with decentralized execution involves having agents which only have access to local (i.e. their own) observations, and during training all agent's policy updates are guided by the same (centralized) critic, while during inference the centralized critic is removed and agents are guided only by their independently learned policies. The centralized critic used during training encompasses information from all agents, alleviating the non-stationarity of the environment.

    Decentralized execution

    During inference the centralized critic module is removed, forcing each agent to make decisions guided by local observations and their individual policies.

    MADDPG Architecture

    In the MADDPG framework, there are multiple agents, where each agent has a discrete observation space and a continuous action space. Each agent has an online-actor, target-actor, and a critic which uses global information (states and actions from all agents). If the agents are Homogeneous, the actor/target networks can be shared amongst agents

    Learning algorithm

    As in DDPG, MADDPG learns off-policy using an experience replay buffer, where experience is stored as a tuple of all agents a1, ..., aN, state information x, x' required for the update of the critic network, and the agent's rewards r1, ..., rN

    (x, x', a1, ..., aN, r1, ..., rN )

    Critic update

    In general, there may be N independent policies μθi yielding a gradient of:

    ∇θi J(μi) = ℰ x, a ~ D [∇θi μi(ai | oi) ∇θai Qiμ (x, a1, ..., an | ai = μi(oi))]

    Where x, a are sampled from the replay buffer. The centralized action-value function Qi μ is then updated as:

    L(θi) = ℰx, a, r, x' [(Qiμ (x, a1, ..., an) - y) 2],

    where y = ri + γ Qiμ' (x', a1, ..., an)

    Actor update

    The actor update for MADDPG is the same as for DDPG, where each agent's actor μi is updated as:

    θi = ℰ x, a ~ D = [∇θi μi(ai | oi) ∇aiQiμ (x, a1, ..., an) | | ai = μi(oi)]

    The actor's gradient, having access only to the agent's local observations, is scaled by the gradient of the critic which has access to global information, reducing the effects of non-stationarity

    Policy inference

    Rather than explicitly passing in the

    MATD3 Extensions

    We can apply the same extensions which take DDPG to TD3 to the multi-agent case, resulting in better algorithm convergence as in the MATD3 paper

  • Overview

    The PPO method was introduced in 2017 and is an on-policy learning algorithm which alternates between sampling data from the environment and optimizing a surrogate objective function by gradient descent, in contrast to other policy gradient methods which perform a gradient update on each sample before discarding the sample. The PPO algorithm has experimentally been shown to offer better sampling complexity than other policy gradient algorithms, while showing robustness and simplicity over other families of algorithms such as q-learning.

    Clipped Surrogate Objective Function

    The PPO algorithm borrows the surrogate objective function from the Trust Region Policy Optimization (TRPO) algorithm, which states:

    LCPI(θ) = ℰt[At πθ(at | st) / πθ(at | st) = ℰt[rt(θ) At]

    The clipped surrogate objective function that is proposed in PPO is:

    LCLIP(θ) = ℰt[min(rt(θ)At, clip(rt(θ), 1 - ε, 1 + ε) At]

    where ε is a small constant, such as ε=0.02. The first term is the same as LCPI, while the second term clips rt to be within the interval [1 - ε, 1 + ε]. By taking the minimum of the unclipped and clipped objective, we obtain a lower bound for the final objective function which improves the stability of the algorithm.

    During training the policy function is typically guided by a state-value function V(s), which is included in the optimization function as

    LVFt = (Vθ(st) - Vttarg)2

    and an additional entropy term of the policy distribution which helps to ensure sufficient exploration,

    Lentropy = S[π] (st)

    Resulting in a final loss function of Lt = LtCLIP(θ) + c1LVFt + c2Ltentropy

    where c1 and c2 are constants.

    MAPPO

    The PPO algorithm above can be extended to the multi-agent scenario in an analagous way as DDPG to MADDPG. This involves passing the state and actions of all other agents in the environment to the (joint_state, joint_actions) to the Critic of each agent during training, whos value estimate will assist in guiding the learning of the policy (Actor) network. During evaluation, only the policy network is used, and the agents are not provided any external information regarding other agents.