01‐30‐2024 Weekly Tag Up

Attendees

Mean policy is not really getting impacted by performance of total Cg1 constraint
- Regardless of the constraint ratio, the G1 returns are always around 2500 for the mean policy
Probably need to use mean policy for generating a new dataset
- Use mean policy and excessive speed policy?
- Would be more likely see G1 returns below the threshold
  - Better G1 returns and worse G2 returns
  - Could use all three policies 30/30/30/10 (include random actions as well)
New dataset generated AFTER all rounds then another batch of rounds get executed
Reminder: our goal is to minimize the G2 cost while staying below certain G1 constraint
- Another idea would be to increase the percentage of actions from the excess speed policy in the initial dataset
  - 10% increase
  - How do we connect this to the results?
    - Connected to ratio? - probably not, they are doing different things
    - Perhaps if 30% of actions come from excess speed policy, we should expect the mean policy to perform approximately 30% of policy?
  - We should probably address this AFTER we address the reward for the excess speed policy

Currently linear (r = -pension if pension > 0 otherwise r = 0)
Could use r = -log(pension) or r = - pension^(0.5)
- Start with r = - pension^(0.5)
  - Avoids large returns when pension is very small
- Physical meaning: the faster you go, the less we are concerned about your speed, we care more if you are just speeding or not

Solve single objective first
- In the results, show learning curves for both g1 AND g2 (training policy for g1 but want to show inverse relationship between g1 and g2)
Then run same batch offline policy learning using ratio of -0.25
Assess next steps after reviewing results

Learning curves for single objective policies should show returns for both metrics
- I.e. queue length policy should show g1 and g2 returns
- They should "cross" on the learning curve (as g2 improves, g1 gets worse)

Use g1 & g2 return rates instead of total for the update
- Implement this after reviewing results of the changes to single objective