-
Notifications
You must be signed in to change notification settings - Fork 0
05‐07‐2024 Weekly Tag Up
Joe Miceli edited this page May 7, 2024
·
1 revision
- Chi Hui
- Joe
- Experiment 24 is showing that our offline rollouts do not match the online rollouts
- Probably an indication that we are not learning the correct value function
-
We need to verify 2 things:
- The online rollout routine is working properly
- FQE is working properly
-
Replace learned value function (G) with the true value function (Q)
- Obtained from Q-learning (or SARSA)
- This would be the "on policy" value function
- We need to update our single-objective learning to include this kind of learning (we may have a version of it implemented but we haven't been using it)
- Update ablation study to accept a value function
- Obtained from single-objective learning with actor-critic method
- This is the "off policy" value function
- This is the method we will use
- Obtained from Q-learning (or SARSA)
-
Could replace G with reward function
- Adding gamma decay term
- We are avoiding this because we haven't done enough research into how reward functions should be learned