models/research/pcl_rl at master · jbdatascience/models

Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
baseline.py	baseline.py
controller.py	controller.py
env_spec.py	env_spec.py
expert_paths.py	expert_paths.py
full_episode_objective.py	full_episode_objective.py
gym_wrapper.py	gym_wrapper.py
model.py	model.py
objective.py	objective.py
optimizers.py	optimizers.py
policy.py	policy.py
replay_buffer.py	replay_buffer.py
trainer.py	trainer.py
trust_region.py	trust_region.py

Code for several RL algorithms used in the following papers:

"Improving Policy Gradient by Exploring Under-appreciated Rewards" by Ofir Nachum, Mohammad Norouzi, and Dale Schuurmans.
"Bridging the Gap Between Value and Policy Based Reinforcement Learning" by Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans.
"Trust-PCL: An Off-Policy Trust Region Method for Continuous Control" by Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans.

Available algorithms:

Actor Critic
TRPO
PCL
Unified PCL
Trust-PCL
PCL + Constraint Trust Region (un-published)
REINFORCE
UREX

Requirements:

TensorFlow (see http://www.tensorflow.org for how to install/upgrade)
OpenAI Gym (see http://gym.openai.com/docs)
NumPy (see http://www.numpy.org/)
SciPy (see http://www.scipy.org/)

Quick Start:

Run UREX on a simple environment:

python trainer.py --logtostderr --batch_size=400 --env=DuplicatedInput-v0 \
  --validation_frequency=25 --tau=0.1 --clip_norm=50 \
  --num_samples=10 --objective=urex

Run REINFORCE on a simple environment:

python trainer.py --logtostderr --batch_size=400 --env=DuplicatedInput-v0 \
  --validation_frequency=25 --tau=0.01 --clip_norm=50 \
  --num_samples=10 --objective=reinforce

Run PCL on a simple environment:

python trainer.py --logtostderr --batch_size=400 --env=DuplicatedInput-v0 \
  --validation_frequency=25 --tau=0.025 --rollout=10 --critic_weight=1.0 \
  --gamma=0.9 --clip_norm=10 --replay_buffer_freq=1 --objective=pcl

Run PCL with expert trajectories on a simple environment:

python trainer.py --logtostderr --batch_size=400 --env=DuplicatedInput-v0 \
  --validation_frequency=25 --tau=0.025 --rollout=10 --critic_weight=1.0 \
  --gamma=0.9 --clip_norm=10 --replay_buffer_freq=1 --objective=pcl \
  --num_expert_paths=10

Run Mujoco task with TRPO:

python trainer.py --logtostderr --batch_size=25 --env=HalfCheetah-v1 \
  --validation_frequency=5 --rollout=10 --gamma=0.995 \
  --max_step=1000 --cutoff_agent=1000 \
  --objective=trpo --norecurrent --internal_dim=64 --trust_region_p \
  --max_divergence=0.05 --value_opt=best_fit --critic_weight=0.0 \

Run Mujoco task with Trust-PCL:

python trainer.py --logtostderr --batch_size=1 --env=HalfCheetah-v1 \
  --validation_frequency=50 --rollout=10 --critic_weight=0.0 \
  --gamma=0.995 --clip_norm=40 --learning_rate=0.002 \
  --replay_buffer_freq=1 --replay_buffer_size=20000 \
  --replay_buffer_alpha=0.1 --norecurrent --objective=pcl \
  --max_step=100 --tau=0.0 --eviction=fifo --max_divergence=0.001 \
  --internal_dim=64 --cutoff_agent=1000 \
  --replay_batch_size=25 --nouse_online_batch --batch_by_steps \
  --sample_from=target --value_opt=grad --value_hidden_layers=2 \
  --update_eps_lambda --unify_episodes --clip_adv=1.0 \
  --target_network_lag=0.99 --prioritize_by=step

Run Mujoco task with PCL constraint trust region:

python trainer.py --logtostderr --batch_size=25 --env=HalfCheetah-v1 \
  --validation_frequency=5 --tau=0.001 --rollout=50 --gamma=0.99 \
  --max_step=1000 --cutoff_agent=1000 \
  --objective=pcl --norecurrent --internal_dim=64 --trust_region_p \
  --max_divergence=0.01 --value_opt=best_fit --critic_weight=0.0 \
  --tau_decay=0.1 --tau_start=0.1

Maintained by Ofir Nachum (ofirnachum).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pcl_rl

pcl_rl

README.md

Files

pcl_rl

Directory actions

More options

Directory actions

More options

Latest commit

History

pcl_rl

Folders and files

parent directory

README.md