Reinforcement Learning, TicTacToe experiments

Experimenting with various RL algorithms.

Some RL theory

Reinforcement learning is founded on Bellman's optimality condition:

V(s) = max_a[ R(s,a) + γV(s') ]

At the basic level there are 2 techniques:

Policy iteration starts with a specific policy π and updates it with respect to its value V_π(s). The algorithm has 2 parts: policy evaluation and policy improvement.
Value iteration updates V(s) or Q(s,a) for all states (the policy would be derived from the value function). The second form is known as Q-learning.

Q(s,a) is a special form of the global value function V(s). Their relation is such that V(s) is the restriction of Q(s,a) when a takes on its optimal value. In other words:

V(s) = max_a Q(s,a).

DQN (deep Q-learning) leads to the success story in 2013 playing "Atari games".

Problem of discrete actions

However, Q-learning requires to find the max of Q(s,a) in order to recover the optimal action. If Q(s,a) is given by a neural network, such an operation is not always possible. In the discrete case, we get around this problem by outputting the Q value for each action; but we cannot do that for continuous actions.

For AGI, we need to deal with actions in high-dimensional vector-space embeddings, so continuous actions seems required. But I am experimenting with outputting the logits (ie, probability distribution) directly, which seems to circumvent this problem. The Transformer outputs logits too.

The policy gradient method can get around this problem because it directly differentiates the cumulative reward against the policy, ie, calculate ∇_θ J(θ) where J is the total reward and θ is the parametrization of the policy π. So it does not involve the V(s) or Q(s,a) functions.

The policy gradient method leads to the later development of Actor-Critic algorithms and also DDPG (deep deterministic policy gradient) and PPO.

In this demo, I try to demonstrate that symmetric NN can be applied to RL to achieve superior results in tasks that involve logical reasoning.

The state vector

The "plain" state vector is just a simple array of size 3 × 3 = 9, with each array element taking on values {1,0,-1} for players 1 and -1, and empty = 0.

The "logical" state vector uses a sequence of moves to represent the state. This is because I want the new state to be a set of propositions, not just one big proposition.

Each proposition = (x, y, p) is a vector of dimension 3, where (x, y) is the 3 × 3 square position and p represents player 1, -1 or empty (0). All 3 numbers can vary continuously; We just map some intervals to the discrete values. This is analogous to the way we would embed "concepts" in vector space in the future.

To run

(NOTE: I am currently transitioning the code to pyTorch)

Requires

 TensorFlow 2.0
 or pyTorch 1.12.1
 Python 3.8

For example, on my Ubuntu computer I'd activate the virtual environment:

 source ~/venv/bin/activate

Run this to install Gym TicTacToe:

 pip3 install gym==0.19.0
 cd gym-tictactoe
 python setup.py install

To check Gym version, in Python:

 >>> from gym.version import VERSION
 >>> print(VERSION)

To run the experiments:

 python run_TicTacToe.py

This will show a menu of choices:

Python Q-table
PyTorch DQN
PyTorch PG symmetric NN
PyTorch PG fully-connected NN
TensorFlow PG symmetric NN
TensorFlow PG fully-connected NN
PyTorch PG Transformer
PyTorch SAC fully-connected NN
PyTorch DQN Transformer

Some options may be broken as I work on newer versions. Ask me directly if you encounter problems.

This is a plot to compare the performance of "fully-connected" (blue) vs "symmetric" (red) NN:

Convergence can be observed early on (1000-2000), but afterwards performance remains unstable though above average. This behavior is observed in both the "plain" and "symmetric" version, indicating that it might be a problem in the policy gradient approach as applied to this game.

To plot graphs like the above: （This can be run during training!)

python plot.py

The program will list a choice of all data files in the directory.

The GUI

This may require to install:

npm install ws
pip install websockets

Run the Websocket server:

node ws.mjs &

Open the GUI.html file in your browser.

In the code run_TicTacToe.py you have to set RENDER=1 or 2. Level 2 it renders every move, level 1 renders only the ending position. But you can set RENDER=_ by pressing Ctrl-C during run-time. Remember to refresh the browser to connect the websocket.

Acknowledgement

[1] The policy gradient demo, which originally solves the Cart Pole problem, is borrow from Morvan Zhou (周沫凡/莫烦): https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow

[2] The Tic-Tac-Toe AI Gym code is borrowed from Clément Romac: https://clementromac.github.io/projects/gym-tictactoe/

[3] The DeepSets code is borrowed from the paper's original authors: https://github.com/manzilzaheer/DeepSets

Name		Name	Last commit message	Last commit date
Latest commit History 224 Commits
PyTorch_models		PyTorch_models
TensorFlow_models		TensorFlow_models
gym-LLM		gym-LLM
gym-tictactoe		gym-tictactoe
imgs		imgs
results		results
.gitignore		.gitignore
DQN_Transformer.py		DQN_Transformer.py
DQN_logic.py		DQN_logic.py
DQN_logic_dim1.py		DQN_logic_dim1.py
DQN_logic_symNN.py		DQN_logic_symNN.py
DQN_loop_Transformer.py		DQN_loop_Transformer.py
DQN_loop_symNN.py		DQN_loop_symNN.py
DQN_shrink.py		DQN_shrink.py
DQN_shrink_SymNN.py		DQN_shrink_SymNN.py
GUI.html		GUI.html
NOTES-tic-tac-toe-tests.txt		NOTES-tic-tac-toe-tests.txt
PG_Transformer(bad,aggregate-by-sum).py		PG_Transformer(bad,aggregate-by-sum).py
PG_Transformer(bad,uniform-distro).py		PG_Transformer(bad,uniform-distro).py
PG_Transformer(fail,aggregate-by-mul).py		PG_Transformer(fail,aggregate-by-mul).py
PG_Transformer(fail,discretized).py		PG_Transformer(fail,discretized).py
PG_Transformer.py		PG_Transformer.py
PG_full.py		PG_full.py
PG_full_TensorFlow.py		PG_full_TensorFlow.py
PG_logic.py		PG_logic.py
PG_symNN.py		PG_symNN.py
PG_symNN_TensorFlow.py		PG_symNN_TensorFlow.py
Qtable.npy		Qtable.npy
README-RL-with-autoencoder.md		README-RL-with-autoencoder.md
README-symmetric-NN.md		README-symmetric-NN.md
README.md		README.md
RL-with-autoencoder-TTT.png		RL-with-autoencoder-TTT.png
RL-with-autoencoder.png		RL-with-autoencoder.png
RL_DQN.py		RL_DQN.py
RL_Qtable-old.py		RL_Qtable-old.py
RL_Qtable.py		RL_Qtable.py
SAC-old.py		SAC-old.py
SAC.png		SAC.png
SAC_full.py		SAC_full.py
TF-graph 2020-06-12 12-04-24.png		TF-graph 2020-06-12 12-04-24.png
TTT_expectation_value.py		TTT_expectation_value.py
TicTacToe-training.gif		TicTacToe-training.gif
Transformer-vs-full.png		Transformer-vs-full.png
cartpole-model-1.dict		cartpole-model-1.dict
cartpole-model-2.dict		cartpole-model-2.dict
comparison.png		comparison.png
easy-game-1.png		easy-game-1.png
g-and-h-networks.png		g-and-h-networks.png
good-result.png		good-result.png
minimal-RL.png		minimal-RL.png
model.Qtable.(3^9x9).25-12-2023(20:39).npy		model.Qtable.(3^9x9).25-12-2023(20:39).npy
package-lock.json		package-lock.json
plot.py		plot.py
reacher.py		reacher.py
reacher.pyc		reacher.pyc
results-cartpole-1.png		results-cartpole-1.png
results-cartpole-2.png		results-cartpole-2.png
results.png		results.png
results.symNN.TensorFlow.old-2.png		results.symNN.TensorFlow.old-2.png
results.symNN.TensorFlow.old-3.png		results.symNN.TensorFlow.old-3.png
results.symNN.TensorFlow.old-4.png		results.symNN.TensorFlow.old-4.png
run_Cartpole(old).py		run_Cartpole(old).py
run_LLM.py		run_LLM.py
run_MountainCar(old).py		run_MountainCar(old).py
run_TTT_single_game.py		run_TTT_single_game.py
run_TicTacToe.py		run_TicTacToe.py
run_cartpole_demo.py		run_cartpole_demo.py
shrink-sym-2.dict		shrink-sym-2.dict
shrink-sym.dict		shrink-sym.dict
state-with-delta.png		state-with-delta.png
state-with-intermediates.png		state-with-intermediates.png
symmetric-vs-full.png		symmetric-vs-full.png
websocket-test.py		websocket-test.py
websocket.py		websocket.py
ws.mjs		ws.mjs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reinforcement Learning, TicTacToe experiments

Some RL theory

Problem of discrete actions

The state vector

To run

The GUI

Acknowledgement

About

Releases

Packages

Languages

Cybernetic1/reinforcement-learning-experiments

Folders and files

Latest commit

History

Repository files navigation

Reinforcement Learning, TicTacToe experiments

Some RL theory

Problem of discrete actions

The state vector

To run

The GUI

Acknowledgement

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages