Alternatives to MCTS? #245
beppeben
started this conversation in
Conceptual
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
This is probably a trivial question.
While reading about MCTS for the first time I was wondering if one could make a self-play learner that is simpler than that.
The thing that would come most natural to me would be to simply learn a network V(s, a) that outputs the expected reward of taking action a on board s.
During each step of self play, one would sample an action with probability proportional to V(s, a) by evaluating the network over all possible action values (with some temperature parameter for better exploration). The loss function would then just be the mean of [V(s_t, a_t)-z_t]^2, over all sampled states s_t, chosen actions a_t and obtained rewards z_t.
In this way the network would be smaller (one output value instead of a full distribution), the loss function is simpler and there are no auxiliary variables (like Q or N).
I suppose I'm missing something otherwise someone would have done it already. Anyone knows why it doesn't work?
Beta Was this translation helpful? Give feedback.
All reactions