Skip to content

05‐13‐2024 Weekly Tag Up

Joe Miceli edited this page May 14, 2024 · 2 revisions

Attendees

  • Joe
  • Chi Hui

Updates

  • Interesting observation found during investigation of OfflineRollout (in this case, using pi_queue to evaluate 'true' G2 value function)
    • Policy that we are learning is not necessarily always producing the optimal actions
    • Example output:
    // Given Q(s,a) values for a batch are...
    q_vals = tensor([[-53.6890, -53.8984, -53.9681, -54.0499],
                     [-56.3677, -64.9737, -51.6730, -65.2594],
                     [-51.3202, -57.0683, -51.0640, -57.0104],
                     [-50.8664, -51.4212, -51.0820, -51.5279],
                     [-52.7676, -53.4041, -52.9656, -53.5141],
                     [-51.9948, -61.2472, -53.2955, -60.7665],
                     [-50.6403, -51.1513, -50.8912, -51.2644],
                     [-49.0660, -49.5205, -49.4524, -49.6094],
                     [-51.8456, -52.3394, -52.1655, -52.4277],
                     [-51.1850, -51.5609, -51.5902, -51.6839],
                     [-48.9024, -49.4357, -49.3467, -49.5483],
                     [-53.0473, -61.0626, -53.6673, -60.7544],
                     [-52.3027, -52.8130, -52.6198, -52.9309],
                     [-50.5199, -60.4218, -52.0259, -59.8795],
                     [-52.3616, -52.9330, -52.5870, -53.0479],
                     [-50.5357, -50.9945, -50.7138, -51.0971]])
    
    // The policy produces the following set of actions...
    actions = tensor([0, 2, 2, 0, 0, 0, 0, 0, 3, 3, 0, 0, 0, 0, 0, 0])
    
    // However, taking the argmax of Q(s,a) reveals that the actions should be...
    argmax_q_vals = tensor([0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
    
    • Does this indicate that we should be training the single objective policies for longer?
    • This may not have any impact on the offline/online rollout comparison?
      • Online rollouts use reward function summed over entire episode (this is provided by the env)
        • This is not discounted (UNCLEAR IF IT SHOULD BE)
      • Offline rollouts use the Q function to estimate the return

Next Steps

  • Updating Q network only for the action that was used to select the target value
    • Typically updates are performed for the entire network but this causes values to be very similar between each action
    • Updating just for the selected action would make the optimal actions more obvious
    • This change in update would just be for the critic
  • Re-train single objective policies
  • Make sure true value functions are using DISCOUNT FACTOR
  • Re-run ablation study using off-policy true value functions (and homogenous datasets again)
  • Add in On-policy version of the "true" value function and then re-run ablation study
  • At this point we are trying to figure out why our offline rollouts are so different from our online rollouts
    • They probably will never be identical but they should be closer!
    • We may need to investigate different RL algorithms (ways to replace FQE or OfflineRollout)
Clone this wiki locally