-
Notifications
You must be signed in to change notification settings - Fork 0
01‐30‐2024 Weekly Tag Up
Joe Miceli edited this page Feb 7, 2024
·
2 revisions
- Joe
- Chi-Hui
- Discuss latest experiment results
- Changes to lambda update?
- Empirical way to differentiate between objectives?
- Mean policy is not really getting impacted by performance of total Cg1 constraint
- Regardless of the constraint ratio, the G1 returns are always around 2500 for the mean policy
- Probably need to use mean policy for generating a new dataset
- Use mean policy and excessive speed policy?
- Would be more likely see G1 returns below the threshold
- Better G1 returns and worse G2 returns
- Could use all three policies 30/30/30/10 (include random actions as well)
- New dataset generated AFTER all rounds then another batch of rounds get executed
- Reminder: our goal is to minimize the G2 cost while staying below certain G1 constraint
- Another idea would be to increase the percentage of actions from the excess speed policy in the initial dataset
- 10% increase
- How do we connect this to the results?
- Connected to ratio? - probably not, they are doing different things
- Perhaps if 30% of actions come from excess speed policy, we should expect the mean policy to perform approximately 30% of policy?
- We should probably address this AFTER we address the reward for the excess speed policy
- Another idea would be to increase the percentage of actions from the excess speed policy in the initial dataset
- Currently linear (
r = -pension if pension > 0 otherwise r = 0
) - Could use
r = -log(pension)
orr = - pension^(0.5)
- Start with
r = - pension^(0.5)
- Avoids large returns when pension is very small
- Physical meaning: the faster you go, the less we are concerned about your speed, we care more if you are just speeding or not
- Start with
- Solve single objective first
- In the results, show learning curves for both g1 AND g2 (training policy for g1 but want to show inverse relationship between g1 and g2)
- Then run same batch offline policy learning using ratio of -0.25
- Assess next steps after reviewing results
- Learning curves for single objective policies should show returns for both metrics
- I.e. queue length policy should show g1 and g2 returns
- They should "cross" on the learning curve (as g2 improves, g1 gets worse)
- Use g1 & g2 return rates instead of total for the update
- Implement this after reviewing results of the changes to single objective