Question about advantage calculation #60

leeacord · 2024-06-17T10:28:08Z

Hi,

I have a question regarding the implementation of the advantage calculation. The code snippet is as follows:

Lines 252 to 274 in 07d906e

    
           def actor_loss(self, seq, target): 
        
             # Actions:      0   [a1]  [a2]   a3 
        
             #                  ^  |  ^  |  ^  | 
        
             #                 /   v /   v /   v 
        
             # States:     [z0]->[z1]-> z2 -> z3 
        
             # Targets:     t0   [t1]  [t2] 
        
             # Baselines:  [v0]  [v1]   v2    v3 
        
             # Entropies:        [e1]  [e2] 
        
             # Weights:    [ 1]  [w1]   w2    w3 
        
             # Loss:              l1    l2 
        
             metrics = {} 
        
             # Two states are lost at the end of the trajectory, one for the boostrap 
        
             # value prediction and one because the corresponding action does not lead 
        
             # anywhere anymore. One target is lost at the start of the trajectory 
        
             # because the initial state comes from the replay buffer. 
        
             policy = self.actor(tf.stop_gradient(seq['feat'][:-2])) 
        
             if self.config.actor_grad == 'dynamics': 
        
               objective = target[1:] 
        
             elif self.config.actor_grad == 'reinforce': 
        
               baseline = self._target_critic(seq['feat'][:-2]).mode() 
        
               advantage = tf.stop_gradient(target[1:] - baseline) 
        
               action = tf.stop_gradient(seq['action'][1:-1]) 
        
               objective = policy.log_prob(action) * advantage

advantage = tf.stop_gradient(target[1:] - self._target_critic(seq['feat'][:-2]).mode())

Based on my understanding:

seq['feat'] contains time steps from 0 to horizon.
target contains time steps from 0 to horizon-1, since the value at the last step is used as a bootstrap for lambda_return.
Therefore, baseline in Line 271 includes time steps from 0 to horizon-2, and target[1:] includes time steps from 1 to horizon-1.

If I understand correctly, the code uses $V_{t+1}^{\lambda} - v_\xi\left(\hat{z}_t\right)$ to calculate the advantage,

not $V_t^{\lambda} - v_{\xi}(\hat{z}_t)$ as stated in the paper?

I am very impressed with the Dreamer series of reinforcement learning algorithms. Thank you for your hard work!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about advantage calculation #60

Question about advantage calculation #60

leeacord commented Jun 17, 2024 •

edited

Loading

Question about advantage calculation #60

Question about advantage calculation #60

Comments

leeacord commented Jun 17, 2024 • edited Loading

leeacord commented Jun 17, 2024 •

edited

Loading