-
Notifications
You must be signed in to change notification settings - Fork 0
04‐12‐2024 Weekly Tag Up
Joe Miceli edited this page Apr 12, 2024
·
2 revisions
- Chi Hui
- Joe
- Joe spent some time re-analyzing the single objective policies
- The speed overage policy in most cases was performing worse than the queue policy when evaluated against the g1 metric
- Meaning if we're trying to learn to obey the g1 constraint, the speed overage policy actually wasn't helping us
- The exception to this was the the speed overage model with bounds of 1.0/13.89
- This policy performed better than the queue policy according to g1 but it actually was producing gridlock (many stopped cars)
- Goal was to find a new single objective policy that did not completely stop cars but had a conflicting objective to the queue policy
- Found that "avg speed limit 7" policy accomplished this goal
SL = 7
R(s) = - (avg_speed - SL) if avg_speed > SL
R(s) = - (SL - avg_speed) if avg_speed <= SL
R(s) = 0 else
- Performed better than queue policy according to the g1 metric (now defined as the reward used by the avg speed limit 7 policy)
- Performed worse than the queue policy according to the g2 metric (stopped cars) but STILL ALLOWS CARS THROUGH THE SYSTEM
- 2 potential new areas of exploration
- Find new way to update lambda
- Randomly assign lambda .5/.5, .7/.3, .8/.2, etc.
- Figure out whether our off-policy evaluation is good
- Current method is off-policy because we are using the dataset not generated by the policy itself
- Could be a major topic in our paper
- Find new way to update lambda
- Try changing the experiment so that more actions are coming from the speed model
- E.g. 50/30/20 ratios
- Keep same constraint ratio & lambda learning rate
- Also try generating a dataset with 100% actions from "avg speed limit 7" policy (and setting constraint ratio to 0)
- Might give insight into if we have something wrong somewhere with FQE or FQI
- Finally, try reverting back to the old threshold 1.0/13.89 policy
- Even though this isn't useful, the combined result could be useful
- Can run some more experiments with this but need to confirm that the learned policy actually produces the behavior we want (i.e. no stopped cars)