Skip to content

04‐12‐2024 Weekly Tag Up

Joe Miceli edited this page Apr 12, 2024 · 2 revisions

Attendees

  • Chi Hui
  • Joe

Updates

  • Joe spent some time re-analyzing the single objective policies
  • The speed overage policy in most cases was performing worse than the queue policy when evaluated against the g1 metric
    • Meaning if we're trying to learn to obey the g1 constraint, the speed overage policy actually wasn't helping us
    • The exception to this was the the speed overage model with bounds of 1.0/13.89
      • This policy performed better than the queue policy according to g1 but it actually was producing gridlock (many stopped cars)
  • Goal was to find a new single objective policy that did not completely stop cars but had a conflicting objective to the queue policy
  • Found that "avg speed limit 7" policy accomplished this goal
    • SL = 7
    • R(s) = - (avg_speed - SL) if avg_speed > SL
    • R(s) = - (SL - avg_speed) if avg_speed <= SL
    • R(s) = 0 else
    • Performed better than queue policy according to the g1 metric (now defined as the reward used by the avg speed limit 7 policy)
    • Performed worse than the queue policy according to the g2 metric (stopped cars) but STILL ALLOWS CARS THROUGH THE SYSTEM

Next Steps

  • 2 potential new areas of exploration
    • Find new way to update lambda
      • Randomly assign lambda .5/.5, .7/.3, .8/.2, etc.
    • Figure out whether our off-policy evaluation is good
      • Current method is off-policy because we are using the dataset not generated by the policy itself
      • Could be a major topic in our paper
  • Try changing the experiment so that more actions are coming from the speed model
    • E.g. 50/30/20 ratios
    • Keep same constraint ratio & lambda learning rate
  • Also try generating a dataset with 100% actions from "avg speed limit 7" policy (and setting constraint ratio to 0)
    • Might give insight into if we have something wrong somewhere with FQE or FQI
  • Finally, try reverting back to the old threshold 1.0/13.89 policy
    • Even though this isn't useful, the combined result could be useful
    • Can run some more experiments with this but need to confirm that the learned policy actually produces the behavior we want (i.e. no stopped cars)
Clone this wiki locally