-
Notifications
You must be signed in to change notification settings - Fork 0
05‐13‐2024 Weekly Tag Up
Joe Miceli edited this page May 14, 2024
·
2 revisions
- Joe
- Chi Hui
- Interesting observation found during investigation of
OfflineRollout
(in this case, using pi_queue to evaluate 'true' G2 value function)- Policy that we are learning is not necessarily always producing the optimal actions
- Example output:
// Given Q(s,a) values for a batch are... q_vals = tensor([[-53.6890, -53.8984, -53.9681, -54.0499], [-56.3677, -64.9737, -51.6730, -65.2594], [-51.3202, -57.0683, -51.0640, -57.0104], [-50.8664, -51.4212, -51.0820, -51.5279], [-52.7676, -53.4041, -52.9656, -53.5141], [-51.9948, -61.2472, -53.2955, -60.7665], [-50.6403, -51.1513, -50.8912, -51.2644], [-49.0660, -49.5205, -49.4524, -49.6094], [-51.8456, -52.3394, -52.1655, -52.4277], [-51.1850, -51.5609, -51.5902, -51.6839], [-48.9024, -49.4357, -49.3467, -49.5483], [-53.0473, -61.0626, -53.6673, -60.7544], [-52.3027, -52.8130, -52.6198, -52.9309], [-50.5199, -60.4218, -52.0259, -59.8795], [-52.3616, -52.9330, -52.5870, -53.0479], [-50.5357, -50.9945, -50.7138, -51.0971]]) // The policy produces the following set of actions... actions = tensor([0, 2, 2, 0, 0, 0, 0, 0, 3, 3, 0, 0, 0, 0, 0, 0]) // However, taking the argmax of Q(s,a) reveals that the actions should be... argmax_q_vals = tensor([0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
- Does this indicate that we should be training the single objective policies for longer?
- This may not have any impact on the offline/online rollout comparison?
- Online rollouts use reward function summed over entire episode (this is provided by the env)
- This is not discounted (UNCLEAR IF IT SHOULD BE)
- Offline rollouts use the Q function to estimate the return
- Online rollouts use reward function summed over entire episode (this is provided by the env)
- Updating Q network only for the action that was used to select the target value
- Typically updates are performed for the entire network but this causes values to be very similar between each action
- Updating just for the selected action would make the optimal actions more obvious
- This change in update would just be for the critic
- Re-train single objective policies
- Make sure true value functions are using DISCOUNT FACTOR
- Re-run ablation study using off-policy true value functions (and homogenous datasets again)
- Add in On-policy version of the "true" value function and then re-run ablation study
- At this point we are trying to figure out why our offline rollouts are so different from our online rollouts
- They probably will never be identical but they should be closer!
- We may need to investigate different RL algorithms (ways to replace FQE or OfflineRollout)