04‐12‐2024 Weekly Tag Up

Jump to bottom

Joe Miceli edited this page Apr 12, 2024 · 2 revisions

Attendees

Chi Hui
Joe

Updates

Joe spent some time re-analyzing the single objective policies
The speed overage policy in most cases was performing worse than the queue policy when evaluated against the g1 metric
- Meaning if we're trying to learn to obey the g1 constraint, the speed overage policy actually wasn't helping us
- The exception to this was the the speed overage model with bounds of 1.0/13.89
  - This policy performed better than the queue policy according to g1 but it actually was producing gridlock (many stopped cars)
Goal was to find a new single objective policy that did not completely stop cars but had a conflicting objective to the queue policy
Found that "avg speed limit 7" policy accomplished this goal
- SL = 7
- R(s) = - (avg_speed - SL) if avg_speed > SL
- R(s) = - (SL - avg_speed) if avg_speed <= SL
- R(s) = 0 else
- Performed better than queue policy according to the g1 metric (now defined as the reward used by the avg speed limit 7 policy)
- Performed worse than the queue policy according to the g2 metric (stopped cars) but STILL ALLOWS CARS THROUGH THE SYSTEM

Next Steps

2 potential new areas of exploration
- Find new way to update lambda
  - Randomly assign lambda .5/.5, .7/.3, .8/.2, etc.
- Figure out whether our off-policy evaluation is good
  - Current method is off-policy because we are using the dataset not generated by the policy itself
  - Could be a major topic in our paper
Try changing the experiment so that more actions are coming from the speed model
- E.g. 50/30/20 ratios
- Keep same constraint ratio & lambda learning rate
Also try generating a dataset with 100% actions from "avg speed limit 7" policy (and setting constraint ratio to 0)
- Might give insight into if we have something wrong somewhere with FQE or FQI
Finally, try reverting back to the old threshold 1.0/13.89 policy
- Even though this isn't useful, the combined result could be useful
- Can run some more experiments with this but need to confirm that the learned policy actually produces the behavior we want (i.e. no stopped cars)