Masks for LstmPolicy in PPO1 #60

hejujie · 2018-10-11T16:47:01Z

Step function in LstmPolicy is called without masks
I am using ppo1 with LstmPolicy in an environment based on gym. After setup up of model in pposgd_simple.py, trpo_mpi.utils.traj_segment_generatoris called in learn function, and then LstmPolicy.step() is called without masks in traj_segment_generator(), while masks is need to be feed in LstmPolicy.step(), and error was occur here.
I also find step() is also called by a2c.py while it get mask from runner(), So I am trying to write some code follow a2c.py. While I want to know whether there are easier way to fixed this.

Relate code

Train function:

def train(env, args):

    env.reset()
    model = PPO1(
        LstmPolicy,
        env,
        timesteps_per_actorbatch=int(args.actor_batch),
        clip_param=0.2,
        entcoeff=0.01,
        optim_epochs=5,
        optim_stepsize=args.learning_rate,
        optim_batchsize=int(args.optim_batch),
        gamma=0.99,
        lam=0.95,
        schedule='linear',
    )
    model.learn(
        total_timesteps=int(args.num_timesteps),
    )
    model.save(args.save_filename)
    return model

In ppo1.pposgd_simple learn()

from stable_baselines.trpo_mpi.utils import traj_segment_generator, add_vtarg_and_adv, flatten_lists

    def learn(self, total_timesteps, callback=None, seed=None, log_interval=100, tb_log_name="PPO1"):
        with SetVerbosity(self.verbose), TensorboardWriter(self.graph, self.tensorboard_log, tb_log_name) as writer:
            self._setup_learn(seed)

            assert issubclass(self.policy, ActorCriticPolicy), "Error: the input policy for the PPO1 model must be " \
                                                               "an instance of common.policies.ActorCriticPolicy."

            with self.sess.as_default():
                self.adam.sync()

                # Prepare for rollouts
                seg_gen = traj_segment_generator(self.policy_pi, self.env, self.timesteps_per_actorbatch)

In trpo_mpi.utils traj_segment_generator():

 while True:
        prevac = action
        action, vpred, states, _ = policy.step(observation.reshape(-1, *observation.shape), states, done)
        # Slight weirdness here because we need value function at time T
        # before returning segment [0, T-1] so we get the correct
        # terminal value
        if step > 0 and step % horizon == 0:
            # Fix to avoid "mean of empty slice" warning when there is only one episode
            if len(ep_rets) == 0:
                ep_rets = [cur_ep_ret]
                ep_lens = [cur_ep_len]
                ep_true_rets = [cur_ep_true_ret]
                total_timesteps = cur_ep_len
            else:
                total_timesteps = sum(ep_lens) + cur_ep_len

In common.policies LstmPolicy.step()

    def step(self, obs, state=None, mask=None, deterministic=False):
        if deterministic:
            return self.sess.run([self.deterministic_action, self._value, self.snew, self.neglogp],
                                 {self.obs_ph: obs, self.states_ph: state, self.masks_ph: mask})
        else:
            return self.sess.run([self.action, self._value, self.snew, self.neglogp],
                                 {self.obs_ph: obs, self.states_ph: state, self.masks_ph: mask})

Error information

Traceback (most recent call last):
  File "/opt/conda/envs/tensorflow-py36/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/envs/tensorflow-py36/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/workspace/rl/action_detection_reinforcement_learning/main/openai_stable_baseline_ppo1.py", line 142, in <module>
    main()
  File "/workspace/rl/action_detection_reinforcement_learning/main/openai_stable_baseline_ppo1.py", line 136, in main
    train(env, args)
  File "/workspace/rl/action_detection_reinforcement_learning/main/openai_stable_baseline_ppo1.py", line 74, in train
    total_timesteps=int(args.num_timesteps),
  File "/workspace/rl/stable-baselines/stable_baselines/ppo1/pposgd_simple.py", line 215, in learn
    seg = seg_gen.__next__()
  File "/workspace/rl/stable-baselines/stable_baselines/trpo_mpi/utils.py", line 58, in traj_segment_generator
    action, vpred, states, _ = policy.step(observation.reshape(-1, *observation.shape), states, done)
  File "/workspace/rl/stable-baselines/stable_baselines/common/policies.py", line 226, in step
    {self.obs_ph: obs, self.states_ph: state, self.masks_ph: mask})
  File "/opt/conda/envs/tensorflow-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 900, in run
    run_metadata_ptr)
  File "/opt/conda/envs/tensorflow-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1111, in _run
    str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape () for Tensor 'input/masks_ph:0', which has shape '(1,)'

The text was updated successfully, but these errors were encountered:

brendenpetersen · 2018-10-11T21:18:13Z

I've encountered this same issue. I also mentioned it in #35 , but @hill-a hasn't mentioned it yet. Also note the error messaged differed for me based on TensorFlow 1.8 and 1.12.

Same issue for TRPO, by the way. I haven't yet tried others.

araffin · 2018-10-11T21:51:24Z

Hi,
Thanks for reporting the issue.
Looking at the code, it seems that nothing is done for recurrent policies, so they may be not supported yet for ppo1 and trpo (both use mpi for multiprocessing). In that case, documentation should be updated anyway. I'm waiting for @hill-a answer too.

hejujie · 2018-10-12T04:17:44Z

@brendenpetersen, I also encountered the same problem when call setup model in pposgd_simple.py, I think it's also a bug. When call self.policy(), n_envs=self.n_envs, n_steps=1, while n_batch is None.
So when call BasePolicy, a dimension of the placeholder will be set to be None(It's related to n_batch), so when use the placeholder, the error will report placeholder is not fully defined. I fixed it by call self.policy() by n_batch=n_steps*self.n_envs.

def setup_model(self):
        with SetVerbosity(self.verbose):

            self.graph = tf.Graph()
            with self.graph.as_default():
                self.sess = tf_util.single_threaded_session(graph=self.graph)

                # Construct network for new policy
                self.policy_pi = self.policy(self.sess, self.observation_space, self.action_space, self.n_envs, 1,
                                             None, reuse=False)

                # Network for old policy
                with tf.variable_scope("oldpi", reuse=False):
                    old_pi = self.policy(self.sess, self.observation_space, self.action_space, self.n_envs, 1,
                                         None, reuse=False)

with tf.variable_scope("input", reuse=False):
            if obs_phs is None:
                self.obs_ph, self.processed_x = observation_input(ob_space, n_batch, scale=scale)
            else:
                self.obs_ph, self.processed_x = obs_phs
            self.masks_ph = tf.placeholder(tf.float32, [n_batch], name="masks_ph")  # mask (done t-1)
            self.states_ph = tf.placeholder(tf.float32, [self.n_env, n_lstm * 2], name="states_ph")  # states
            self.action_ph = None
            if add_action_ph:
                self.action_ph = tf.placeholder(dtype=ac_space.dtype, shape=(None,) + ac_space.shape, name="action_ph")

Fixed the fully defined bug:

    def setup_model(self):
        with SetVerbosity(self.verbose):

            self.graph = tf.Graph()
            with self.graph.as_default():
                self.sess = tf_util.single_threaded_session(graph=self.graph)

                # Construct network for new policy
                self.policy_pi = self.policy(self.sess, self.observation_space, self.action_space, self.n_envs, 1,
                                             self.n_envs*1, reuse=False)

                # Network for old policy
                with tf.variable_scope("oldpi", reuse=False):
                    old_pi = self.policy(self.sess, self.observation_space, self.action_space, self.n_envs, 1,
                                         self.n_envs*1, reuse=False)

brendenpetersen · 2018-10-12T18:36:16Z

@hejujie That doesn't work for me. To clarify, the only changes you made were changing None to self.n_envs*1 in 2 places?

I get this error:

Traceback (most recent call last):
  File "launcher.py", line 72, in <module>
    main()
  File "launcher.py", line 68, in main
    drl.learn(total_timesteps=1e99, callback=callback)
  File "/Users/petersen33/repositories/stable-baselines/stable_baselines/ppo1/pposgd_simple.py", line 215, in learn
    seg = seg_gen.__next__()
  File "/Users/petersen33/repositories/stable-baselines/stable_baselines/trpo_mpi/utils.py", line 58, in traj_segment_generator
    action, vpred, states, _ = policy.step(observation.reshape(-1, *observation.shape), states, done)
  File "/Users/petersen33/repositories/stable-baselines/stable_baselines/common/policies.py", line 219, in step
    {self.obs_ph: obs, self.states_ph: state, self.masks_ph: mask})
  File "/Users/petersen33/repositories/venv3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 887, in run
    run_metadata_ptr)
  File "/Users/petersen33/repositories/venv3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1086, in _run
    str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape () for Tensor 'input/masks_ph:0', which has shape '(1,)'

EDIT: If I wrap done in trpo_mpi/utils.py line 58 as [done], it gets past that error but my very first action becomes a vector of nan.

araffin · 2019-06-02T10:15:03Z

closing this issue in favor if #140

araffin added the bug Something isn't working label Oct 11, 2018

araffin mentioned this issue Oct 14, 2018

[question] Cartpole PPO1 example and alternate policies #35

Closed

araffin mentioned this issue Dec 20, 2018

LSTM policies are broken for PPO1 and TRPO #140

Open

araffin closed this as completed Jun 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Masks for LstmPolicy in PPO1 #60

Masks for LstmPolicy in PPO1 #60

hejujie commented Oct 11, 2018

brendenpetersen commented Oct 11, 2018

araffin commented Oct 11, 2018

hejujie commented Oct 12, 2018

brendenpetersen commented Oct 12, 2018 •

edited

Loading

araffin commented Jun 2, 2019

Masks for LstmPolicy in PPO1 #60

Masks for LstmPolicy in PPO1 #60

Comments

hejujie commented Oct 11, 2018

brendenpetersen commented Oct 11, 2018

araffin commented Oct 11, 2018

hejujie commented Oct 12, 2018

brendenpetersen commented Oct 12, 2018 • edited Loading

araffin commented Jun 2, 2019

brendenpetersen commented Oct 12, 2018 •

edited

Loading