Skip to content

01‐30‐2024 Weekly Tag Up

Joe Miceli edited this page Feb 7, 2024 · 2 revisions

Attendees

  • Joe
  • Chi-Hui

Agenda

  • Discuss latest experiment results
  • Changes to lambda update?
  • Empirical way to differentiate between objectives?

Updates

  • Mean policy is not really getting impacted by performance of total Cg1 constraint
    • Regardless of the constraint ratio, the G1 returns are always around 2500 for the mean policy
  • Probably need to use mean policy for generating a new dataset
    • Use mean policy and excessive speed policy?
    • Would be more likely see G1 returns below the threshold
      • Better G1 returns and worse G2 returns
      • Could use all three policies 30/30/30/10 (include random actions as well)
  • New dataset generated AFTER all rounds then another batch of rounds get executed
  • Reminder: our goal is to minimize the G2 cost while staying below certain G1 constraint
    • Another idea would be to increase the percentage of actions from the excess speed policy in the initial dataset
      • 10% increase
      • How do we connect this to the results?
        • Connected to ratio? - probably not, they are doing different things
        • Perhaps if 30% of actions come from excess speed policy, we should expect the mean policy to perform approximately 30% of policy?
      • We should probably address this AFTER we address the reward for the excess speed policy

Changing Excess Speed Reward

  • Currently linear (r = -pension if pension > 0 otherwise r = 0)
  • Could use r = -log(pension) or r = - pension^(0.5)
    • Start with r = - pension^(0.5)
      • Avoids large returns when pension is very small
    • Physical meaning: the faster you go, the less we are concerned about your speed, we care more if you are just speeding or not

Next Steps

  • Solve single objective first
    • In the results, show learning curves for both g1 AND g2 (training policy for g1 but want to show inverse relationship between g1 and g2)
  • Then run same batch offline policy learning using ratio of -0.25
  • Assess next steps after reviewing results

Empirically relating performance metrics

  • Learning curves for single objective policies should show returns for both metrics
    • I.e. queue length policy should show g1 and g2 returns
    • They should "cross" on the learning curve (as g2 improves, g1 gets worse)

Updates to lambda learning

  • Use g1 & g2 return rates instead of total for the update
    • Implement this after reviewing results of the changes to single objective
Clone this wiki locally