- It's been observed that DQN (from the Atari paper) overestimates some actions - but it wasn't know why. This paper proves why it happens.
- Specific adaptation to reduce overestimation and lead to much better performance
- The goal of RL is to learn a policy. During Sutton&Barto's work - certain actions were overestimated because there's a maximization step over certain action values.
- Previously overestimations were attributed to insufficiently flexible function approximation
- Unknown whether overestimations affect performance in practice
- To test it - test with DQN in a set of deterministic Atari 2600 games. Even in such a favorable scenario, overestimation happens.
- Show that the idea behind Double Q-Learning can be applied to do Double DQN and obtain accurately estimated actions in Atari, hence state of the art RL results
- To estimate the policy per state, action - we take into account the sum of expected rewards, where gamma is a discount factor for later rewards
- To update, notice most problems will need a theta to consider specific timestamps
- A Deep Q Network has an input of a state 's' and an output vector of action values for a certain theta timestamp, and state.
- Two important concepts are experience replay and the use of a target network
- Target network is the same as the online network except its parameters are copied tau steps from the online networks
- The max operator in Q-Learning and DQN uses the same values to select and evaluate an action
- This is what makes it more likely to overestimate actions - being too optimistic that certain actions will perform well at all timesteps.
- In double Q-Learning two value functions are learned by assigning experience randomly to update one of the two value functions
- This way we learn in:
- Set of weights theta: a greedy policy
- Set of weights theta prime: evaluation of the greedy policy
- Action values contain random errors uniformly distributed upon -epsilon, epsilon interval as shown in Thrun, Schwartz (1993)
- Thrun and Schwartz gave an upper bound for these errors, however more interestingly - you can derive from that there's a lower bound of 0 (theorem 1 in the paper)
- Important: we do not need to even assume that estimation errors for different actions are independent
- Overestimations indeed increase with the number of actions
- The idea is to reduce overestimation by decomposing the max operation into action selection and evaluation
- Proposal is to use the online network to evaluate the greedy policy, and to use the target network to estimate its value
- The only change is that the weights of the 2nd network theta_prime are replaced with the weights of the target network theta for the evaluation of the policy
- Evaluated on the Atari 2600 games testbed, which are high-dimensional and game mechanics change substantially between games.
- Show that double DQN results are not only less overestimated but also much more stable over time
- The learning curves for double DQN are much closer to the line in blue representing the true value of the policy