Skip to content

Commit

Permalink
Minor refactoring, renaming and resolved of bad paths in readme/reports
Browse files Browse the repository at this point in the history
  • Loading branch information
lajd committed Sep 27, 2020
1 parent 143c4c1 commit 6a917f3
Show file tree
Hide file tree
Showing 18 changed files with 153 additions and 66 deletions.
73 changes: 56 additions & 17 deletions readme.md → README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@

## Overview
This repository contains code reinforcement learning code for solving the
Udacity Deep reinforcement Learning projects.
Udacity Deep reinforcement Learning projects. It has been refactored in such a way
as to be useful for applying the implemented algorithms to new environments with minimal setup.

## Prerequisites
- Anaconda
Expand All @@ -14,6 +15,9 @@ All models are developed in Pytorch
Recreate the Anaconda environment with: <br/>
`conda env create -f environment.yml`

Activate the conda environment with: <br/>
`conda activate drl_toolbox`

## Repository Structure
The code is organized as follows:

Expand Down Expand Up @@ -63,21 +67,56 @@ The code is organized as follows:
- [Task/environment Details](tasks/banana_collector/TASK_DETAILS.md)
- [REPORT.md](tasks/banana_collector/solutions/ray_tracing_banana/REPORT.md)
- [RESULTS.pdf](tasks/banana_collector/solutions/ray_tracing_banana/RESULTS.pdf)
- [Train](tasks/banana_collector/solutions/ray_tracing_banana/banana_solution_train.py)
- [Eval](tasks/banana_collector/solutions/ray_tracing_banana/banana_solution_eval.py)
- Visual (pixel) implementation
- [Train DQN](tasks/banana_collector/solutions/ray_tracing_banana/banana_solution_train.py)
- [Eval DQN](tasks/banana_collector/solutions/ray_tracing_banana/banana_solution_eval.py)
<br/>
<img src="https://github.com/lajd/drl_toolbox/blob/master/tasks/banana_collector/solutions/ray_tracing_banana/solution_checkpoint/trained_banana_agent.gif?raw=true" width="350" height="200" />
<br/>

- Visual Banana (pixel) implementation
- [Task/environment Details](tasks/banana_collector/TASK_DETAILS.md)
- [REPORT.md](tasks/banana_collector/solutions/pixel_banana/REPORT.md)
- [Train](tasks/banana_collector/solutions/pixel_banana/banana_visual_solution_train.py)
- [Eval](tasks/banana_collector/solutions/pixel_banana/banana_visual_solution_train.py)
- Reacher continuous control (20-agent) implementation
- [Task/environment Details](tasks/reacher_continuous_control/TASK_DETAILS.md)
- [REPORT.md](tasks/reacher_continuous_control/solutions/ddpg/REPORT.md)
- [Train DDPG](tasks/reacher_continuous_control/solutions/ddpg/train_ddpg_baseline.py)
- [Eval DDPG](tasks/reacher_continuous_control/solutions/ddpg/eval_ddpg_baseline.py)
- [Train TD3](tasks/reacher_continuous_control/solutions/ddpg/train_ddpg_baseline.py)
- [Eval TD3](tasks/reacher_continuous_control/solutions/ddpg/eval_td3_baseline.py)
- [Train DQN](tasks/banana_collector/solutions/pixel_banana/banana_visual_solution_train.py)
- [Eval DQN](tasks/banana_collector/solutions/pixel_banana/banana_visual_solution_train.py)
<br/>
<img src="https://github.com/lajd/drl_toolbox/blob/master/tasks/banana_collector/solutions/pixel_banana/solution_checkpoint/trainined_visual_banana_agent.gif?raw=true" width="350" height="200" />
<br/>

- Reacher (20 homogeneous agents) implementation
- [Task/environment Details](tasks/reacher/TASK_DETAILS.md)
- [REPORT.md](tasks/reacher/solutions/ddpg/REPORT.md)
- [Train TD3](tasks/reacher/solutions/ddpg/train_td3_per.py)
- [Eval TD3](tasks/reacher/solutions/ddpg/eval_td3_per.py)
<br/>
<img src="https://github.com/lajd/drl_toolbox/blob/master/tasks/reacher/solutions/ddpg/solution_checkpoint/trained_reacher_agent.gif?raw=true" width="350" height="200" />
<br/>
- Crawler (12 homogeneous agents) implementation
- [Task/environment Details](tasks/crawler/TASK_DETAILS.md)
- [REPORT.md](tasks/crawler/solutions/ppo/REPORT.md)
- [Train TD3](tasks/crawler/solutions/ppo/eval_ppo.py)
- [Eval TD3](tasks/crawler/solutions/ppo/train_ppo.py)
<br/>
<img src="https://github.com/lajd/drl_toolbox/blob/master/tasks/crawler/solutions/ppo/solution_checkpoint/trained_crawler_agent.gif?raw=true" width="350" height="200" />
<br/>

- Soccer (multi agent) implementation
- [Task/environment Details](tasks/soccer/TASK_DETAILS.md)
- [REPORT.md](tasks/soccer/REPORT.md)
- [Train MAPPO](tasks/soccer/solutions/mappo/train_mappo.py)
- [Eval MAPPO](tasks/soccer/solutions/mappo/eval_mappo.py)
<br/>
<img src="https://github.com/lajd/drl_toolbox/blob/master/tasks/soccer/solutions/mappo/solution_checkpoint/trained_soccer_agent.gif?raw=true" width="350" height="200" />
<br/>


- Tennis (multi agent)) implementation
- [Task/environment Details](tasks/tennis/TASK_DETAILS.md)
- [REPORT.md](tasks/tennis/REPORT.md)
- [Train MAPPO](tasks/tennis/solutions/mappo/train_mappo.py)
- [Eval MAPPO](tasks/tennis/solutions/mappo/eval_mappo.py)
<br/>
<img src="https://github.com/lajd/drl_toolbox/blob/master/tasks/tennis/solutions/mappo/solution_checkpoint/trained_tennis_agent.gif?raw=true" width="350" height="200" />
<br/>

## Agent Implementations and explanation
Currently only the [Deep Q-Network](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf) algorithm is implemented, along with
Expand Down Expand Up @@ -135,7 +174,7 @@ Below, we discuss the algorithm at a high level, along with the implemented exte
Y<sub>t</sub><sup>Double-DQN</sup> &equiv; R<sub>t</sub> + &gamma;Q(s<sub>t+1</sub>, argmax<sub>a</sub> Q(s<sub>t+1</sub>, a<sub></sub>; &theta;<sub>t</sub>) &theta;<sub>t</sub><sup>-</sup>)
See the `compute_errors` method of the [Base Policy](agents/policies/base.py) class for code implementation
See the `compute_errors` method of the [Base Policy](agents/policies/base_policy.py) class for code implementation

###### [Prioritized Experience Replay (PER)](https://arxiv.org/abs/1511.05952)
Rather than performing learning updates on experiences as they are sampled from the environment (i.e. sequentially through time), the DQN
Expand Down Expand Up @@ -169,7 +208,7 @@ Below, we discuss the algorithm at a high level, along with the implemented exte
A SumTree data structure is implemented to perform weighted sampling efficiently. See the implementation
of the [PER buffer](agents/memory/prioritized_memory.py), and the [SumTree](tools/data_structures/sumtree.py).
See the `compute_errors` method of the [Base Policy](agents/policies/base.py) class shows where importance weights
See the `compute_errors` method of the [Base Policy](agents/policies/base_policy.py) class shows where importance weights
are applied to scale the gradients, and `step` method of the [DQNAgent](agents/dqn_agent.py) contains the implementation
of updating the priorities.
Expand Down Expand Up @@ -199,7 +238,7 @@ Below, we discuss the algorithm at a high level, along with the implemented exte
###### [Distributional (Categorical) DQN network](https://arxiv.org/abs/1707.06887)
The categorical DQN algorithm attempts to model the `return distribution` for an action, rather than the
`expected return`, thus modelling the distribution of Q(s, a). The categorical DQN is implemented in
the `get_output` method of [dqn](agents/models/dqn.py), with corresponding [categorical policy](agents/policies/categorical.py)
the `get_output` method of [dqn](agents/models/dqn.py), with corresponding [categorical policy](agents/policies/categorical_policy.py)
which is responsible for computing the errors between the target and online network distributions. Please refer
to the paper for theoretical details and to this [reference implementation](https://github.com/higgsfield/RL-Adventure/blob/master/7.rainbow%20dqn.ipynb),
from which the code is adapted from.
Expand Down Expand Up @@ -374,7 +413,7 @@ Below, we discuss the algorithm at a high level, along with the implemented exte
where c1 and c2 are constants.

###### MAPPO
The PPO algorithm above can be extended to the multi-agent scenario in an analagous way as DDPG to MADDPG. This
The PPO algorithm above can be extended to the multi-agent scenario in an analogous way as DDPG to MADDPG. This
involves passing the state and actions of all other agents in the environment to the (joint_state, joint_actions)
to the Critic of each agent during training, whos value estimate will assist in guiding the learning of the policy (Actor)
network. During evaluation, only the policy network is used, and the agents are not provided any external information regarding
Expand Down
19 changes: 0 additions & 19 deletions agents/models/ppo.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
import torch.nn as nn
import torch.nn.functional as F
from typing import Optional
import numpy as np
from tools.misc import set_seed


Expand All @@ -25,15 +24,6 @@ def step_episode(self):
pass

def forward(self, state, action=None, scale=1, min_std=0.05, *args, **kargs):
"""Build Policy.
Returns
======
action (Tensor): predicted action or inputed action
log_prob (Tensor): log probability of current action distribution
ent (Tensor): entropy of current action distribution
value (Tensor): estimate value function
"""
assert min_std >= 0 and scale >= 0
if self.continuous_actions:
action_mean = self.actor(state)
Expand Down Expand Up @@ -73,15 +63,6 @@ def step_episode(self):

def forward(self, agent_state: torch.FloatTensor, other_agent_states: torch.FloatTensor,
other_agent_actions: Optional[torch.FloatTensor] = None, action: Optional[torch.FloatTensor] = None, min_std=0.05, scale=1,):
"""Build Policy.
Returns
======
action (Tensor): predicted action or inputed action
log_prob (Tensor): log probability of current action distribution
ent (Tensor): entropy of current action distribution
value (Tensor): estimate value function
"""
assert min_std > 0 and scale >= 0, (min_std, scale)

if self.continuous_actions:
Expand Down
2 changes: 1 addition & 1 deletion tasks/banana_collector/solutions/pixel_banana/REPORT.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,4 +127,4 @@ Value channel: </br>
![value][image5]

Basic experiments were performed with the above dimensionality techniques, which can be found in [tools](../../../../tools/image_utils.py),
however the network has signfiicant difficulty learning from them (at least in the constraints imposed by the memory issue).
however the network has significant difficulty learning from them (at least in the constraints imposed by the memory issue).
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions tasks/crawler/TASK_DETAILS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
[image2]: https://user-images.githubusercontent.com/10624937/43851646-d899bf20-9b00-11e8-858c-29b5c2c94ccc.png "Crawler"

### (Optional) Challenge: Crawler Environment

Expand Down
82 changes: 82 additions & 0 deletions tasks/crawler/solutions/ppo/REPORT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
[scores]: solution_checkpoint/ppo_training_scores.png "PPO Baseline Results"

# Crawler
Please see the [repository overview](../../../../README.md) as well as the [task description](../../TASK_DETAILS.md)
before reading this report. The theoretical details of the utilized algorithms can be found in the [repository overview](../../../../README.md).

In this task there are 12 crawler agents who's goal is to reach a static location in the environment as fast as possible
(i.e. minimize falling and maximize for speed).

<img src="https://github.com/lajd/drl_toolbox/blob/master/tasks/crawler/solutions/ppo/solution_checkpoint/trained_crawler_agent.gif?raw=true" width="400" height="250" />

# Solution Overview

The solutions discussed in this report rely on the PPO algorithm. All 12 algorithms share the same PPO brain
(actor-critic and optimizer) and the same shared trajectory buffer. During training, agents may perform
batch learning by sampling from the shared replay buffer. After a small number of learning epochs, the experience
samples are discarded.

The actor-critic architecture has the following form:

```
PPO_Actor_Critic(
(actor): MLP(
(mlp_layers): Sequential(
(0): BatchNorm1d(129, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(1): Linear(in_features=129, out_features=128, bias=True)
(2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): LeakyReLU(negative_slope=True)
(4): Linear(in_features=128, out_features=128, bias=True)
(5): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(6): LeakyReLU(negative_slope=True)
(7): Linear(in_features=128, out_features=20, bias=True)
(8): Tanh()
)
)
(critic): MLP(
(mlp_layers): Sequential(
(0): BatchNorm1d(129, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(1): Linear(in_features=129, out_features=128, bias=True)
(2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): LeakyReLU(negative_slope=True)
(4): Linear(in_features=128, out_features=128, bias=True)
(5): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(6): LeakyReLU(negative_slope=True)
(7): Linear(in_features=128, out_features=1, bias=True)
(8): Tanh()
)
)
)
```
The model hyper-parameters are given below:

```
NUM_EPISODES = 3000
SEED = 8
MAX_T = 2000
WEIGHT_DECAY = 1e-4
EPSILON = 1e-5 # epsilon of Adam
LR = 1e-4 # learning rate of the actor-critic
BATCH_SIZE = 1024
DROPOUT = None
BATCHNORM = True
SOLVE_SCORE = 1600
```

## Results

Below we show the plot of mean episode scores (across all agents) versus episode number.

![Training scores][scores]

The environment was solved (mean reward of >=1600) after about 320 episodes.
The training time took roughly 1.3 hours.

## Discussion
The PPO algorithm demonstrated good stability and convergence, and was experimentally shown to be rather robust to changes
in hyperparameters.

## Ideas for Future Work
The MAPPO algorithm demonstrated quick convergence on this task, however it's sample efficiency leaves much to be desired.
In order to increase the sample efficiency, memory replay methods such as [Hindsight Experience Replay (HER)](https://papers.nips.cc/paper/7090-hindsight-experience-replay.pdf) can be implemented,
which better help the agent learn from sparse rewards.
2 changes: 1 addition & 1 deletion tasks/crawler/solutions/ppo/train_ppo.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
MAX_T = 2000
WEIGHT_DECAY = 1e-4
EPSILON = 1e-5 # epsilon of Adam
LR = 1e-4 # learning rate of the actor
LR = 1e-4 # learning rate of the actor-critic
BATCH_SIZE = 1024
DROPOUT = None
BATCHNORM = True
Expand Down
3 changes: 0 additions & 3 deletions tasks/reacher/TASK_DETAILS.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,4 @@
[//]: # (Image References)

[image1]: https://user-images.githubusercontent.com/10624937/43851024-320ba930-9aff-11e8-8493-ee547c6af349.gif "Trained Agent"
[image2]: https://user-images.githubusercontent.com/10624937/43851646-d899bf20-9b00-11e8-858c-29b5c2c94ccc.png "Crawler"


# Project 2: Continuous Control
Expand Down
3 changes: 1 addition & 2 deletions tasks/reacher/solutions/ddpg/REPORT.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
[image1]: https://user-images.githubusercontent.com/10624937/43851024-320ba930-9aff-11e8-8493-ee547c6af349.gif "Trained Agent"
[image2]: resources/ddpg_baseline.png "DDPG Baseline Results"
[image3]: resources/per_td3_baseline.png "TD3 PER Baseline Results"

# Reacher (Continuous Control)
# Reacher
Please see the [repository overview](../../../../README.md) as well as the [task description](../../TASK_DETAILS.md)
before reading this report. The theoretical details of the utilized algorithms can be found in the [repository overview](../../../../README.md).

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
import torch
from tasks.reacher.solutions.utils import get_simulator, BRAIN_NAME
from tasks.reacher.solutions.ddpg import SOLUTIONS_CHECKPOINT_DIR
from tasks.reacher.solutions.ddpg.train_td3_baseline import get_solution_brain_set, MAX_T
from tasks.reacher.solutions.ddpg.train_td3_per import get_solution_brain_set, MAX_T

SAVE_TAG = 'per_td3'
ACTOR_CHECKPOINT = os.path.join(SOLUTIONS_CHECKPOINT_DIR, f'{SAVE_TAG}_actor_checkpoint.pth')
Expand Down
Binary file not shown.
Binary file not shown.
10 changes: 2 additions & 8 deletions tasks/soccer/REPORT.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,9 @@
[trained_soccer]:https://user-images.githubusercontent.com/10624937/42135622-e55fb586-7d12-11e8-8a54-3c31da15a90a.gif "Soccer"
[mappo_results_image]: solutions/mappo/solution_checkpoint/mappo_100_consecutive_wins_training_scores.png "MAPPO Training"

### Multi Agent Soccer Environment
![Soccer][trained_soccer]




# Soccer MAPPO/MATD3 Introduction
Please see the [repository overview](../../../../README.md) as well as the [task description](../../TASK_DETAILS.md)
before reading this report. The theoretical details of the utilized algorithms can be found in the [repository overview](../../../../README.md).
Please see the [repository overview](../../README.md) as well as the [task description](./TASK_DETAILS.md)
before reading this report. The theoretical details of the utilized algorithms can be found in the [repository overview](../../README.md).

In this environment, two teams (each with a Striker/Goalie agent) compete against each other in the game of soccer. The agents can move laterally
and vertically, and the strikers have the additional action of rotating left/right, resulting in 4 and 6 discrete actions for
Expand Down
4 changes: 2 additions & 2 deletions tasks/soccer/setup_linux.sh
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# Execute this script from the /tasks/reacher_continuous_control directory
# Execute this script from the /tasks/soccer directory
# bash ./setup_linux.sh

mkdir -p environments

# Download the reacher environment
# Download the soccer environment
wget https://s3-us-west-1.amazonaws.com/udacity-drlnd/P3/Soccer/Soccer_Linux.zip --no-check-certificate

unzip Soccer_Linux.zip && mv Soccer_Linux environments/ && rm Soccer_Linux.zip
Expand Down
11 changes: 5 additions & 6 deletions tasks/tennis/REPORT.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,9 @@
[trained_tennis_gif]: https://user-images.githubusercontent.com/10624937/42135623-e770e354-7d12-11e8-998d-29fc74429ca2.gif "Trained Agent"
[mappo_results_image]: solutions/mappo/solution_checkpoint/mappo_training_scores.png "MAPPO Training"
[matd3_results_image]: solutions/maddpg/solution_checkpoint/independent_madtd3_training_scores.png "MATD3 Training"

# Tennis MAPPO/MATD3 Introduction
Please see the [repository overview](../../../../README.md) as well as the [task description](./TASK_DETAILS.md)
before reading this report. The theoretical details of the utilized algorithms can be found in the [repository overview](../../../../README.md).
Please see the [repository overview](../../README.md) as well as the [task description](./TASK_DETAILS.md)
before reading this report. The theoretical details of the utilized algorithms can be found in the [repository overview](../../README.md).

In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1. If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01. Thus, the goal of each agent is to keep the ball in play.

Expand All @@ -21,7 +20,7 @@ The unity environment consists of 2 agents which have separate brains (models/op
but can observe the states and actions of the other agents and use this information during training time.


![Trained Agent][trained_tennis_gif]
<img src="https://github.com/lajd/drl_toolbox/blob/master/tasks/tennis/solutions/mappo/solution_checkpoint/trained_tennis_agent.gif?raw=true" width="700" height="450" />

# Solution Overview

Expand Down Expand Up @@ -171,13 +170,13 @@ a score of >0.5 in ~ 2700 episodes (15 minutes), and a score of > 1 in about 320
![Training MATD3 Agent][matd3_results_image]


##### Discussion
## Discussion
The MAPPO algorithm converged *significantly* faster than the MATD3 algorithm, achieving a score of >1 about 33x faster
than the MAPPO algorithm (20 minutes vs. 11 hours). It should be noted, though, that hyper-parameter tuning
(especially on the MATD3 algorithm) was not conducted due to the long training duration. Overall, this result demonstrates
the robustness of the PPO algorithm to a wide range of tasks.

The MAPPO algorithm, beign on-policy, is shown to be relatively sample inefficient compared to off-policy algorithms such as
The MAPPO algorithm, being on-policy, is shown to be relatively sample inefficient compared to off-policy algorithms such as
MATD3, where MAPPO achieved a score of > 1 after 3200 episodes compared to 800 episodes by MATD3. The MATD3 algorithm takes
advantage of prioritized experience replay (PER) to sample experience based on the amount of information the experience provides, while
the MAPPO algorithm has no such intelligent memory buffer and simply discards trajectories of experience after a few learning epochs.
Expand Down
Loading

0 comments on commit 6a917f3

Please sign in to comment.