Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Masks for LstmPolicy in PPO1 #60

Closed
hejujie opened this issue Oct 11, 2018 · 5 comments
Closed

Masks for LstmPolicy in PPO1 #60

hejujie opened this issue Oct 11, 2018 · 5 comments
Labels
bug Something isn't working

Comments

@hejujie
Copy link

hejujie commented Oct 11, 2018

Step function in LstmPolicy is called without masks
I am using ppo1 with LstmPolicy in an environment based on gym. After setup up of model in pposgd_simple.py, trpo_mpi.utils.traj_segment_generatoris called in learn function, and then LstmPolicy.step() is called without masks in traj_segment_generator(), while masks is need to be feed in LstmPolicy.step(), and error was occur here.
I also find step() is also called by a2c.py while it get mask from runner(), So I am trying to write some code follow a2c.py. While I want to know whether there are easier way to fixed this.

Relate code

  • Train function:
def train(env, args):

    env.reset()
    model = PPO1(
        LstmPolicy,
        env,
        timesteps_per_actorbatch=int(args.actor_batch),
        clip_param=0.2,
        entcoeff=0.01,
        optim_epochs=5,
        optim_stepsize=args.learning_rate,
        optim_batchsize=int(args.optim_batch),
        gamma=0.99,
        lam=0.95,
        schedule='linear',
    )
    model.learn(
        total_timesteps=int(args.num_timesteps),
    )
    model.save(args.save_filename)
    return model
  • In ppo1.pposgd_simple learn()
from stable_baselines.trpo_mpi.utils import traj_segment_generator, add_vtarg_and_adv, flatten_lists

    def learn(self, total_timesteps, callback=None, seed=None, log_interval=100, tb_log_name="PPO1"):
        with SetVerbosity(self.verbose), TensorboardWriter(self.graph, self.tensorboard_log, tb_log_name) as writer:
            self._setup_learn(seed)

            assert issubclass(self.policy, ActorCriticPolicy), "Error: the input policy for the PPO1 model must be " \
                                                               "an instance of common.policies.ActorCriticPolicy."

            with self.sess.as_default():
                self.adam.sync()

                # Prepare for rollouts
                seg_gen = traj_segment_generator(self.policy_pi, self.env, self.timesteps_per_actorbatch)
  • In trpo_mpi.utils traj_segment_generator():
 while True:
        prevac = action
        action, vpred, states, _ = policy.step(observation.reshape(-1, *observation.shape), states, done)
        # Slight weirdness here because we need value function at time T
        # before returning segment [0, T-1] so we get the correct
        # terminal value
        if step > 0 and step % horizon == 0:
            # Fix to avoid "mean of empty slice" warning when there is only one episode
            if len(ep_rets) == 0:
                ep_rets = [cur_ep_ret]
                ep_lens = [cur_ep_len]
                ep_true_rets = [cur_ep_true_ret]
                total_timesteps = cur_ep_len
            else:
                total_timesteps = sum(ep_lens) + cur_ep_len
  • In common.policies LstmPolicy.step()
    def step(self, obs, state=None, mask=None, deterministic=False):
        if deterministic:
            return self.sess.run([self.deterministic_action, self._value, self.snew, self.neglogp],
                                 {self.obs_ph: obs, self.states_ph: state, self.masks_ph: mask})
        else:
            return self.sess.run([self.action, self._value, self.snew, self.neglogp],
                                 {self.obs_ph: obs, self.states_ph: state, self.masks_ph: mask})

Error information

Traceback (most recent call last):
  File "/opt/conda/envs/tensorflow-py36/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/envs/tensorflow-py36/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/workspace/rl/action_detection_reinforcement_learning/main/openai_stable_baseline_ppo1.py", line 142, in <module>
    main()
  File "/workspace/rl/action_detection_reinforcement_learning/main/openai_stable_baseline_ppo1.py", line 136, in main
    train(env, args)
  File "/workspace/rl/action_detection_reinforcement_learning/main/openai_stable_baseline_ppo1.py", line 74, in train
    total_timesteps=int(args.num_timesteps),
  File "/workspace/rl/stable-baselines/stable_baselines/ppo1/pposgd_simple.py", line 215, in learn
    seg = seg_gen.__next__()
  File "/workspace/rl/stable-baselines/stable_baselines/trpo_mpi/utils.py", line 58, in traj_segment_generator
    action, vpred, states, _ = policy.step(observation.reshape(-1, *observation.shape), states, done)
  File "/workspace/rl/stable-baselines/stable_baselines/common/policies.py", line 226, in step
    {self.obs_ph: obs, self.states_ph: state, self.masks_ph: mask})
  File "/opt/conda/envs/tensorflow-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 900, in run
    run_metadata_ptr)
  File "/opt/conda/envs/tensorflow-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1111, in _run
    str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape () for Tensor 'input/masks_ph:0', which has shape '(1,)'
@brendenpetersen
Copy link

I've encountered this same issue. I also mentioned it in #35 , but @hill-a hasn't mentioned it yet. Also note the error messaged differed for me based on TensorFlow 1.8 and 1.12.

Same issue for TRPO, by the way. I haven't yet tried others.

@araffin araffin added the bug Something isn't working label Oct 11, 2018
@araffin
Copy link
Collaborator

araffin commented Oct 11, 2018

Hi,
Thanks for reporting the issue.
Looking at the code, it seems that nothing is done for recurrent policies, so they may be not supported yet for ppo1 and trpo (both use mpi for multiprocessing). In that case, documentation should be updated anyway. I'm waiting for @hill-a answer too.

@hejujie
Copy link
Author

hejujie commented Oct 12, 2018

@brendenpetersen, I also encountered the same problem when call setup model in pposgd_simple.py, I think it's also a bug. When call self.policy(), n_envs=self.n_envs, n_steps=1, while n_batch is None.
So when call BasePolicy, a dimension of the placeholder will be set to be None(It's related to n_batch), so when use the placeholder, the error will report placeholder is not fully defined. I fixed it by call self.policy() by n_batch=n_steps*self.n_envs.

def setup_model(self):
        with SetVerbosity(self.verbose):

            self.graph = tf.Graph()
            with self.graph.as_default():
                self.sess = tf_util.single_threaded_session(graph=self.graph)

                # Construct network for new policy
                self.policy_pi = self.policy(self.sess, self.observation_space, self.action_space, self.n_envs, 1,
                                             None, reuse=False)

                # Network for old policy
                with tf.variable_scope("oldpi", reuse=False):
                    old_pi = self.policy(self.sess, self.observation_space, self.action_space, self.n_envs, 1,
                                         None, reuse=False)
with tf.variable_scope("input", reuse=False):
            if obs_phs is None:
                self.obs_ph, self.processed_x = observation_input(ob_space, n_batch, scale=scale)
            else:
                self.obs_ph, self.processed_x = obs_phs
            self.masks_ph = tf.placeholder(tf.float32, [n_batch], name="masks_ph")  # mask (done t-1)
            self.states_ph = tf.placeholder(tf.float32, [self.n_env, n_lstm * 2], name="states_ph")  # states
            self.action_ph = None
            if add_action_ph:
                self.action_ph = tf.placeholder(dtype=ac_space.dtype, shape=(None,) + ac_space.shape, name="action_ph")

Fixed the fully defined bug:

    def setup_model(self):
        with SetVerbosity(self.verbose):

            self.graph = tf.Graph()
            with self.graph.as_default():
                self.sess = tf_util.single_threaded_session(graph=self.graph)

                # Construct network for new policy
                self.policy_pi = self.policy(self.sess, self.observation_space, self.action_space, self.n_envs, 1,
                                             self.n_envs*1, reuse=False)

                # Network for old policy
                with tf.variable_scope("oldpi", reuse=False):
                    old_pi = self.policy(self.sess, self.observation_space, self.action_space, self.n_envs, 1,
                                         self.n_envs*1, reuse=False)

@brendenpetersen
Copy link

brendenpetersen commented Oct 12, 2018

@hejujie That doesn't work for me. To clarify, the only changes you made were changing None to self.n_envs*1 in 2 places?

I get this error:

Traceback (most recent call last):
  File "launcher.py", line 72, in <module>
    main()
  File "launcher.py", line 68, in main
    drl.learn(total_timesteps=1e99, callback=callback)
  File "/Users/petersen33/repositories/stable-baselines/stable_baselines/ppo1/pposgd_simple.py", line 215, in learn
    seg = seg_gen.__next__()
  File "/Users/petersen33/repositories/stable-baselines/stable_baselines/trpo_mpi/utils.py", line 58, in traj_segment_generator
    action, vpred, states, _ = policy.step(observation.reshape(-1, *observation.shape), states, done)
  File "/Users/petersen33/repositories/stable-baselines/stable_baselines/common/policies.py", line 219, in step
    {self.obs_ph: obs, self.states_ph: state, self.masks_ph: mask})
  File "/Users/petersen33/repositories/venv3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 887, in run
    run_metadata_ptr)
  File "/Users/petersen33/repositories/venv3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1086, in _run
    str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape () for Tensor 'input/masks_ph:0', which has shape '(1,)'

EDIT: If I wrap done in trpo_mpi/utils.py line 58 as [done], it gets past that error but my very first action becomes a vector of nan.

@araffin
Copy link
Collaborator

araffin commented Jun 2, 2019

closing this issue in favor if #140

@araffin araffin closed this as completed Jun 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants