Skip to content

05‐07‐2024 Weekly Tag Up

Joe Miceli edited this page May 7, 2024 · 1 revision

Attendees

  • Chi Hui
  • Joe

Updates

  • Experiment 24 is showing that our offline rollouts do not match the online rollouts
  • Probably an indication that we are not learning the correct value function

Next Steps

  • We need to verify 2 things:

    • The online rollout routine is working properly
    • FQE is working properly
  • Replace learned value function (G) with the true value function (Q)

    • Obtained from Q-learning (or SARSA)
      • This would be the "on policy" value function
      • We need to update our single-objective learning to include this kind of learning (we may have a version of it implemented but we haven't been using it)
    • Update ablation study to accept a value function
      • Obtained from single-objective learning with actor-critic method
      • This is the "off policy" value function
    • This is the method we will use
  • Could replace G with reward function

    • Adding gamma decay term
    • We are avoiding this because we haven't done enough research into how reward functions should be learned
Clone this wiki locally