We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hi,
I have a question regarding the implementation of the advantage calculation. The code snippet is as follows:
dreamerv2/dreamerv2/agent.py
Lines 252 to 274 in 07d906e
advantage = tf.stop_gradient(target[1:] - self._target_critic(seq['feat'][:-2]).mode())
Based on my understanding:
seq['feat']
0
horizon
target
horizon-1
lambda_return
baseline
horizon-2
target[1:]
1
If I understand correctly, the code uses $V_{t+1}^{\lambda} - v_\xi\left(\hat{z}_t\right)$ to calculate the advantage,
not $V_t^{\lambda} - v_{\xi}(\hat{z}_t)$ as stated in the paper?
I am very impressed with the Dreamer series of reinforcement learning algorithms. Thank you for your hard work!
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Hi,
I have a question regarding the implementation of the advantage calculation. The code snippet is as follows:
dreamerv2/dreamerv2/agent.py
Lines 252 to 274 in 07d906e
Based on my understanding:
seq['feat']
contains time steps from0
tohorizon
.target
contains time steps from0
tohorizon-1
, since the value at the last step is used as a bootstrap forlambda_return
.baseline
in Line 271 includes time steps from0
tohorizon-2
, andtarget[1:]
includes time steps from1
tohorizon-1
.If I understand correctly, the code uses$V_{t+1}^{\lambda} - v_\xi\left(\hat{z}_t\right)$ to calculate the advantage,
not$V_t^{\lambda} - v_{\xi}(\hat{z}_t)$ as stated in the paper?
I am very impressed with the Dreamer series of reinforcement learning algorithms. Thank you for your hard work!
The text was updated successfully, but these errors were encountered: