TicTacToe

Exploring different machine learning solutions to tic tac toe

References:

Algorithms

Minimax
Tabular Q-Learning
Monte Carlo Tree Search
Q Neural Network
Minimax with alpha-beta pruning

All algorithms have the option to use a depth quotient for their reward function. This quotient is equivalent to the number of turns taken. This way the algorithm is incentived to end the game sooner rather than later.

Minimax

Minimax is a decision rule for minimizing the possible loss for a maximum loss scenario. It works by alternating maxing and mining a game score, i.e. changing perspectives between the self and the opponent while walking down the node chain.

Although it is a perfect algorithm for zero sum games, it is computationally expensive as it requires traversing all nodes of the decision tree starting at the node of the current board.

Q-Learning

Q-learning here was implemented with epsilon-greedy action selection with an epsilon tending towards 0 as t tends towards infinite. This was done to promote exploration over exploitation initially.

Q-Learning seems to perform better with Cache1 (getting all symmetrical states) as opposed to Cache2 (getting all symmetrical states).

To curb the over-optimism or potential positive feedback of Q-Learning greedy (max) policy, the use of two tables can be used. This approach is known as Double Q-Learning. In the Double Q-Learning, the one table is updated based on the best action selected from the other table. Table selection here is done randomly to prevent coupling.

Double Q-Learning seems to require more training to achieve parity with Q-Learning but makes more accurate judgements beyond that.

So far Q-Learning performance has only been evaluated offline. Planning to evaluate online performance as well.

Future work:

Exploring online performance
Turn independent training

Monte Carlo Tree Search

Monte Carlo Tree Search is a heuristic decision tree, in more complex problems where computing all possible paths are impossible it is preferred.

UCB (Upper confidence bound) was used to determine the exploitation / exploration ratio, a.k.a. UCT.

As t tends towards infinite it's behaviour approaches that of minimax.

Future work:

Exploring online performance

Q Neural Network

The design of the network used in this implementation is heavily based on DeepMind's work on Deep Q Networks (DQNs): https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf.

Neural networks applied to Q-Learning allow for a more flexible representation of the state-space.

The two features of DQNs that allow neural networks to be applied in Reinforcement Learning (RL) are:

Experience Replay
The use of a target and online network

Experience replay was not applied here to simplify the implementation but the use of separate online and target networks was.

An additional augmentation mirroring Double Q-Learning is known as Double DQN. In which the max operation in the target is decomposed into action selection and action evaluation. The online network selects the action and the target network evaluates the action's value. This alteration was also included.

Future work:

Exploring online performance
Turn independent training
Varying target network update frequency

Alpha-Beta Pruning

Alpha-beta pruning is performance enhancement over minimax. It limits the tree search by not exploring sub-trees that would not create a difference in the search results.

It does so by initialising two values, alpha and beta, that stores the minimized possible gain of the maximizing player and the maximized possible gain of the minimizing player respectively. Alpha is initialised as -inf and beta as inf. The worst possible values for each player.

If alpha is greater than beta, then further exploration of other subtrees will not return a value greater than alpha. The same for beta but for the minimizing player.

A comparison of total move time per game with caching:

Minimax as X and ABPruning as O

AB time taken: 0.08975815773010254 s Minimax time taken: 0.19148850440979004 s

ABPruning as X and Minimax as O

Minimax time taken: 0.05585050582885742 s AB time taken: 0.11322879791259766 s

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
abpruning.py		abpruning.py
board.py		board.py
cache.py		cache.py
human.py		human.py
main.py		main.py
mcts.py		mcts.py
minimax.py		minimax.py
play_game.py		play_game.py
player.py		player.py
qlearning.py		qlearning.py
qneural.py		qneural.py
random_player.py		random_player.py
training_results.csv		training_results.csv
training_results_1606034293.csv		training_results_1606034293.csv
training_results_1606034725.csv		training_results_1606034725.csv
training_results_1606036144.csv		training_results_1606036144.csv
training_results_1606036566.csv		training_results_1606036566.csv
training_results_1606041856.csv		training_results_1606041856.csv
training_results_1606042338.csv		training_results_1606042338.csv
training_results_1606043040.csv		training_results_1606043040.csv
transform.py		transform.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TicTacToe

Algorithms

Minimax

Q-Learning

Monte Carlo Tree Search

Q Neural Network

Alpha-Beta Pruning

About

Releases

Packages

Languages

License

krishenm94/TicTacToe

Folders and files

Latest commit

History

Repository files navigation

TicTacToe

Algorithms

Minimax

Q-Learning

Monte Carlo Tree Search

Q Neural Network

Alpha-Beta Pruning

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages