05‐07‐2024 Weekly Tag Up

Jump to bottom

Joe Miceli edited this page May 7, 2024 · 1 revision

Attendees

Chi Hui
Joe

Updates

Experiment 24 is showing that our offline rollouts do not match the online rollouts
Probably an indication that we are not learning the correct value function

Next Steps

We need to verify 2 things:
- The online rollout routine is working properly
- FQE is working properly
Replace learned value function (G) with the true value function (Q)
- Obtained from Q-learning (or SARSA)
  - This would be the "on policy" value function
  - We need to update our single-objective learning to include this kind of learning (we may have a version of it implemented but we haven't been using it)
- Update ablation study to accept a value function
  - Obtained from single-objective learning with actor-critic method
  - This is the "off policy" value function
- This is the method we will use
Could replace G with reward function
- Adding gamma decay term
- We are avoiding this because we haven't done enough research into how reward functions should be learned