05‐13‐2024 Weekly Tag Up

Attendees

Joe
Chi Hui

Updates

Interesting observation found during investigation of OfflineRollout (in this case, using pi_queue to evaluate 'true' G2 value function)

Policy that we are learning is not necessarily always producing the optimal actions
Example output:

// Given Q(s,a) values for a batch are...
q_vals = tensor([[-53.6890, -53.8984, -53.9681, -54.0499],
                 [-56.3677, -64.9737, -51.6730, -65.2594],
                 [-51.3202, -57.0683, -51.0640, -57.0104],
                 [-50.8664, -51.4212, -51.0820, -51.5279],
                 [-52.7676, -53.4041, -52.9656, -53.5141],
                 [-51.9948, -61.2472, -53.2955, -60.7665],
                 [-50.6403, -51.1513, -50.8912, -51.2644],
                 [-49.0660, -49.5205, -49.4524, -49.6094],
                 [-51.8456, -52.3394, -52.1655, -52.4277],
                 [-51.1850, -51.5609, -51.5902, -51.6839],
                 [-48.9024, -49.4357, -49.3467, -49.5483],
                 [-53.0473, -61.0626, -53.6673, -60.7544],
                 [-52.3027, -52.8130, -52.6198, -52.9309],
                 [-50.5199, -60.4218, -52.0259, -59.8795],
                 [-52.3616, -52.9330, -52.5870, -53.0479],
                 [-50.5357, -50.9945, -50.7138, -51.0971]])

// The policy produces the following set of actions...
actions = tensor([0, 2, 2, 0, 0, 0, 0, 0, 3, 3, 0, 0, 0, 0, 0, 0])

// However, taking the argmax of Q(s,a) reveals that the actions should be...
argmax_q_vals = tensor([0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Does this indicate that we should be training the single objective policies for longer?
This may not have any impact on the offline/online rollout comparison?
- Online rollouts use reward function summed over entire episode (this is provided by the env)
  - This is not discounted (UNCLEAR IF IT SHOULD BE)
- Offline rollouts use the Q function to estimate the return

Next Steps

Updating Q network only for the action that was used to select the target value
- Typically updates are performed for the entire network but this causes values to be very similar between each action
- Updating just for the selected action would make the optimal actions more obvious
- This change in update would just be for the critic
Re-train single objective policies
Make sure true value functions are using DISCOUNT FACTOR
Re-run ablation study using off-policy true value functions (and homogenous datasets again)
Add in On-policy version of the "true" value function and then re-run ablation study
At this point we are trying to figure out why our offline rollouts are so different from our online rollouts
- They probably will never be identical but they should be closer!
- We may need to investigate different RL algorithms (ways to replace FQE or OfflineRollout)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

05‐13‐2024 Weekly Tag Up

Attendees

Updates

Next Steps

Table of Contents

Clone this wiki locally