2048 writeup.tex

\documentclass{article}
\usepackage{amssymb} % needed for \varnothing which is empty set
\usepackage{graphicx} % Required for inserting images

% these two packages needed for argmax
\usepackage{amsmath}
\DeclareMathOperator*{\argmax}{argmax}

% packages needed for pseudocode
\usepackage{algorithm} 
\usepackage{algpseudocode} 

\usepackage{subcaption}% Needed for subcaption for side-by-side figures
\usepackage{indentfirst}% package to indent first paragraph after section header
\usepackage{float}% make tables and figures stay in correct section


\usepackage[table]{xcolor}  % allow coloured cells
\usepackage{tikz} % for designing figures
\usepackage{xcolor} % colouring the figure
\usepackage[section]{placeins} % keep figures in same section 

\title{2048 AI: Monte Carlo-Powered Markov Decision Process}
\author{Noah Ripstein}
\date{}

\begin{document}
\usetikzlibrary{decorations.pathreplacing} % TESTING
\maketitle


\begin{abstract}
A 2048 Python clone was developed and was used to test a series of AI strategies.  A strategy which formalizes the game as a Markov Decision Process powered by Monte Carlo simulations significantly outperformed a Pure Monte Carlo Tree Search algorithm.  Model search parameters were then optimized to find the pareto efficient solution set which maximises performance and minimizes runtime.
\end{abstract}

\section{Introduction}

2048 is an addictive game which was originally released in 2014.  The game begins with two randomly placed tiles, each having a value of either 2 or 4, randomly placed on a 4x4 grid. The player can move the tiles in four directions: up, down, left, or right. When a direction is chosen, all tiles on the grid slide as far as possible in that direction, merging with any adjacent tiles of the same value to form a new tile with double the value.  The value of the new tile is added to the score. After the player's turn,  a new tile spawns in a random location; this new tile has a 90\% chance of being a 2, and a 10\% chance of being a 4.   The game ends when the board is filled with tiles and the player has no legal moves.  The goal of the game is to combine tiles until the 2048 tile is reached (although it is possible to continue playing after winning).

A Pure Monte Carlo Tree Search was first implemented, and then improved upon by formalizing part of the game's decision tree as a Markov Decision Process.  The Markov Decision Processs achieves the 2048 tile 97\% of the time and the 4096 tile 58\% of the time. 


\begin{figure}[htbp]
  \centering
  \begin{subfigure}[b]{0.45\textwidth}
    \includegraphics[width=\textwidth]{original_ss.jpeg}
    \caption{Original Game}
    \label{fig:original_ss}
  \end{subfigure}
  \hfill
  \begin{subfigure}[b]{0.45\textwidth}
    \includegraphics[width=\textwidth]{ss_of_mine.png}
    \caption{Python Clone}
    \label{fig:ss_of_mine}
  \end{subfigure}
  \caption{Screenshots from original game and my implementation}
  \label{fig:screenshots}
\end{figure}
%Discuss goal of game\\
%how its played briefly\\
%after the player selects a move, a new time spawns in a random empty location. It has a 90\% chance of being a 2, and a 10\% chance of being a 4.

\section{The Decision Tree}

One of the challenges of creating an AI which plays 2048 is the sheer number of possible games.  Figure 1 represents the possible board positions after the player makes only one move.  If there are 8 free spaces on the board, for example, then there are 64 possible game states after the player's move (assuming each of left, right, up and down are legal moves which do not combine any tiles).  In general, there are $2(|S|)(m) + c$ possible states after the player's move, where $|S|$ is the number of legal moves,  $m$ is the number of empty spaces on the board after tiles have been combined from the player's turn, and $c$ is the number of tiles which get merged as a result of the player's move.

\begin{figure}[H]
\centering
\includegraphics[width=\textwidth]{tree_1.jpeg}
\caption{Decision tree representing possible board states after each move}
\label{fig:tree1}
\end{figure}

\section{AI Designs}
\subsection{Pure Monte Carlo Tree Search}
The initial algorithm employed is a Pure Monte Carlo Tree Search (Algorithm 1). This algorithm takes the current game position and the desired number of simulations per direction ($n$) as inputs. It explores all legal next moves from the current position by simulating $n$ games for each potential move. The scores from the end of these simulated games are then averaged to determine the desirability of each move. The direction with the highest average score is selected: 

\begin{equation}
\label{PMCTS_move_selection}
  \textrm{Selected Move} = \argmax_{move} \text{PMCTS}(move)
\end{equation}


This approach initiates from the green nodes in the game tree diagram (Fig. 2). From there, the algorithm proceeds through random exploration to reach the red child nodes, representing the spawning of a 2 or 4 tile in each possible location.


While this approach provides a comprehensive exploration of the game tree, it has significant limitations. The primary concern lies in the random nature of the search process. As a consequence, some of the simulated games performed during the Monte Carlo simulations may yield exceptionally poor results that are highly unlikely to occur in actual gameplay.  This lead me to want to discard a portion of those simulated games with particularly poor scores from consideration.

Simply modifying Algorithm 1 to calculate the average score for a given move using only top-performing of simulated games would not adequately address this source of randomness, however.  There are two sources of randomness inherent in the Pure Monte Carlo Tree Search: randomness associated with game-play (which I aim to reduce), and randomness of tile spawns.  Discarding randomly played games with low scores in an attempt to address the former source of randomness might prevent the AI from evaluating branches of the tree which involve unlucky tiles spawning after the next turn.

% could also use delta performance. that way poor moves would be punished more I think 
\begin{algorithm}
\label{PMCTS_algo}
    \caption{Pure Monte Carlo Tree Search}
    \begin{algorithmic}[1]
        \Function{PMCTS}{$\text{currentPosition, n}$}
            \State $\text{directionScores} \gets \text{list()}$
            \For{$\text{nextMove} \in \text{legalMoves}(\text{currentPosition})$}
                \State $\text{nextMoveScores} \gets \text{list()}$
                \For{$\text{gameNumber} \gets 1 \text{ to } n$}
                    \State $\text{result} \gets \text{playRandomGame}(\text{currentPosition}, \text{nextMove})$
                    \State $\text{nextMoveScores}.\text{append}(\text{result})$
                \EndFor
                \State $\text{averageScore} \gets \text{mean(nextMoveScores)}$
                \State $\text{directionScores}.\text{append}(\text{averageScore})$
            \EndFor
            % \State $\text{bestMove} \gets \text{legalMoves}(\text{currentPosition})[\text{scores}.\text{argmax}()]$
            \State \textbf{return} $\text{directionScores}$
        \EndFunction
    \end{algorithmic}
\end{algorithm}

\subsection{Markov Decision Process}
The pitfalls of the Pure Monte Carlo Tree Search raised in section 3.1 can be circumvented by formalizing the game structure as a Markov Decision Process (MDP).  The core functionality of the MCTS is used as part of the MDP-based strategy, but the evaluations made with MCTS are made more beneficial by explicitly maximizing expected value.  Where the PMCTS algorithm begins Monte Carlo simulations from the board state after a player's move (green nodes in Figure 2), The MDP strategy begins Monte Carlo simulations from each possible tile spawn in response to a player's move (red nodes in Figure 2).  It is possible to model the game as a Markov decision process because the game satisfies the Markov property: the probability of reaching a future state depends only on the current state, not on previous states: 
$$P(s|s_{n-1}, s_{n-2},...,s_{0}) = P(s|s_{n-1}) $$

\noindent In general, an is MDP is characterized as follows:
\begin{enumerate}
	\item $S$: The set of states (board position and score) which could arise after the AI's next move.
	\item $A$: The set of actions which the AI could legally take from the current position.
	\item $P(s^\prime | s, a)$: Transition probability of reaching state $s^\prime$ given current state $s \in S$ after taking action $a \in A$.
	\item $V(s)$: The value function which determines the expected future reward associated with entering a state $s \in S$
	\item $\pi(s)$: The policy function which uses the other functions to strategically pick an action $a \in A$ given any board state.
	\item $R(s)$: The immediate reward associated with taking action $a \in A$ which leads to state $s \in S$ is a part of many MDP designs.  Justification for why this function was excluded can be found in section 3.3.
\end{enumerate}

$A$ can be constructed by checking which of left, right, up or down are legal moves. $S$ can be constructed by placing a 2 and then a 4 on each empty tile for each $a \in A$ (Algorithm \ref{CREATE_S}). The value function, $V(s, \vec{\theta})$,  (Algorithm \ref{value_function}) performs a Monte Carlo tree search, with the number of simulations determined by the parameter vector $\vec{\theta}$.  $\vec{\theta}$ contains the following parameters:
\begin{enumerate}
	\item Number of random searches for 2 spawning with 1-3 empty tiles.
	\item Number of random searches for 2 spawning with 4-6 empty tiles.
	\item Number of random searches for 2 spawning with 7-9 empty tiles.
	\item Number of random searches for 2 spawning with 10-15 empty tiles.
	\item How many times more searches 2 spawns should get compared to 4 spawns.
	\item Top proportion of best performing moves to include for score evaluation.
\end{enumerate}

$\theta_1, \theta_2, \theta_3$ are scaled so their respective empty tile ranges have the same total number of simulations. $\theta_4$ uses the same number of simulations for each of 10-15 empty tiles.  If $\theta_1 = 500$, for example, the Monte Carlo simulation will run 500 times when there is one tile, 250 times when there are two, and 125 when there are three; if $\theta_4 = 20$, then the Monte Carlo simulation will run 20 times for any number of empty tiles $\in [10, 15s]$. 

$\vec{\theta}$ remains constant throughout a single play-through of the game.  As will be discussed in Section 6.1, optimizing $\vec{\theta}$ became a primary direction of inquiry. The policy function, which determines what action $a \in A$ to take, $\pi(s)$, is given by Equation \ref{MC12_policy_fn}:

\begin{equation}
\label{MC12_policy_fn}
\pi(s) = \argmax\limits_{a}\left( \sum\limits_{s' \in S} P(s'|s, a)V(s', \vec{\theta}) \right)
\end{equation}


Notice that $\pi(s)$ picks the action with the highest expected value, where the value associated with reaching a state is given by a Monte Carlo tree search.  Compared to the Pure Monte Carlo Tree Search, the MDP facilitates more sophisticated inference in two ways. 

%This is appropriate for $\theta_4$ because beyond early stages of the game, it is uncommon for there to be 10 or more empty tiles, which makes scaling this parameter unnecessarily computationally expensive.


\begin{enumerate}
  \item By discarding a portion of explored games with poor results at each node, the impact of the Monte Carlo search playing out extremely poor moves due to chance can be mitigated.  Crucially, simulations with lower scores due to "unlucky" tile spawns in the subsequent turn are not eliminated, ensuring a more comprehensive exploration of potential game outcomes.
  \item Unlike the Pure Monte Carlo Tree Search, where game states in which a 2 spawns after the players turn are explored 9x as frequently as those in which a 4 spawns, the number of Monte Carlo searches on each node type can be made independent.  This can guarantee that all nodes are explored at least 3 times, which gives far more information than a node being explored 1 time, but does not significantly increase runtime.  
 
\end{enumerate}

The MDP is implemented in a manner which makes customizable the number of Monte Carlo searches per node ($\theta_1 - \theta_5$) and proportion of top-performing simulated scores to keep ($\theta_6$).  Each of these parameters impact the reported score associated with a direction, and therefore the move selected by $\pi(s)$.  


\begin{algorithm}
    \caption{Generation of possible next states $S$}
    \label{CREATE_S}
    \begin{algorithmic}[1]
    		\Function{getStates}{A}
    		\State $S = \varnothing$
            \For{$a \in A$}
                \State $c_s \gets$ score associated with taking action $a$
                \For{tileNum $\in \{2, 4 \}$}
                	\For{tile in emptyTiles}
                		\State place tileNum on tile
                		\State Add $s \gets$ (board, $c_s$) to $S$
                		\State remove tileNum from tile
                	\EndFor
                \EndFor
            \EndFor
		\State \textbf{Return }$S$
        \EndFunction
    \end{algorithmic}
\end{algorithm}

\begin{algorithm}
    \caption{Value Function: $V(s, \vec{\theta})$}
    \label{value_function}
    \begin{algorithmic}[1]
        \Function{$V(s, \vec{\theta})$}{}
            \State $resultList \gets$ empty list
            \For{$i \gets 1$ to $\theta_{\text{num\_sims}}$}\Comment{$\theta_{\text{num\_sims}}$ is a model parameter}
                \State $result \gets$ \Call{RandomGame}{$s$} \Comment{Call the RandomGame function}
                \State add $result$ to $resultList$ \Comment{Record the result}
            \EndFor
            \State sort $resultList$ in ascending order
            \State $proportion \gets \theta_{\text{proportion}} \times \text{length}(resultList)$
            \State $topResults \gets resultList[1:\text{round}(proportion)]$ \Comment{Select the top results}
            \State \textbf{return} $topResults$
        \EndFunction
        \Statex
        \Function{RandomGame}{$s$}
            \While{game not over}
                \State Pick random $a \in A$
                \State Make move $a$
            \EndWhile
            \State \textbf{return} game score
        \EndFunction
    \end{algorithmic}
\end{algorithm}

\subsection{The Problem with Bellman's Equation}
A common approach in MDP applications is to use Bellman's Equation to evaluate the utility of taking action $a$.  Rather than the utility of entering state $s^\prime$ being given by $V(s^\prime)$, Bellman's equation would hold that the utility would be given by $R(s, a) + \gamma V(s^\prime)$.  Here, $R(s, a)$ is the immediate guaranteed value associated with taking action $a$ when in state $s$, and $\gamma \in [0, 1]$ is a discount rate for expected future rewards.  In principle, and in many other applications of MDP, this approach outperforms simply using $V(s^\prime)$, because it adheres to the principle that immediate, certain rewards are more favourable than uncertain future rewards.  Equation \ref{Bellman_policy} shows how Bellman's Equation could be incorporated into the policy function, $\pi (s)$.

Despite the theoretical backing, using Equation \ref{Bellman_policy} as the policy function decreased performance compared to Equation \ref{MC12_policy_fn} (tested with $\gamma$ = 0.9) and brought performance roughly in line with the less computationally intense Pure Monte Carlo Tree Search.  This is likely because the random games explored using the Monte Carlo value function tend to end with a score only marginally higher than the current score.  This leads immediate score gain, $R(s, a)$, to have a much larger impact on the policy than future scores in Equation \ref{Bellman_policy}.  This is problematic because it makes the algorithm more "greedy," and prioritizes immediate rewards too much over future rewards (even with high values of $\gamma$).
\begin{equation}
\label{Bellman_policy}
\pi(s) = \argmax\limits_{a}\left( \sum\limits_{s' \in S} P(s'|s, a)(R(s,a) + \gamma V(s', \vec{\theta})) \right)
\end{equation}


\section{Results}
%can have some sort of Bayesian BDF bc can model the proportion of time it reaches 2048 as berneulli so can do classic PDF (or just use a beta distribution).\\

\begin{table}[h]
  \caption{Results of Different Models}
  \begin{tabular}{|l|c|c|c|c|c|}
    \hline
    & 2048 & 4096 & 8192 & Avg. Score \\
    \hline
    Pure MCTS & 77\% & 10\% & 0\% & 31,011 \\
    \hline
    MDP (top 100\%) & \cellcolor{green!25}97\% & \cellcolor{green!25}58\% & 0\% & 53,232\\
    \hline
    MDP (top 75\%) & 94\% & \cellcolor{green!25}58\% & 1\% & 53,565\\
    \hline
    MDP (top 50\%) & 94\% & 57\% & 2\% & 53,282\\
    \hline
    MDP (top 25\%) & 93\% & \cellcolor{green!25}58\% & \cellcolor{green!25}5\% & \cellcolor{green!25}55,966 \\
    \hline
  \end{tabular}
\end{table}

\begin{figure}[H]
\centering
\includegraphics[width=\textwidth]{MC12_and_4 violin.png}
\caption{Model Score Distribution}
\label{fig:violin1}
\end{figure}

\section{Discussion}
% this para below can go in discussion once I do more param optimization
Model score is most strongly influenced by the proportion of time the 4096 tile is reached.  Many trials (with a wide range of parameters) which do not reach the 4096 tile achieve a score around 36,000, when they are near the 4096 tile.  AI strategies are likely to fail shortly before reaching the next milestone tile because this is when there are the most high tiles on the board which can not yet be joined.  Once reaching the 4096 tile, it is uncommon for a score below 60,000 to be reached, although there is a meaningful distribution between 60,000 and 80,000, where 80,000 represents the point near achieving a 8192 tile.   This distribution of results requires that a given set of parameters is evaluated multiple times on the model before reporting performance.  In order to illustrate this, consider the following example:

Suppose that a set of parameters $\vec{\theta_1}$ has a true probability of reaching the 4096 tile 60\% of the time, with the model scoring 35,000 when it doesn't reach 4096, and 70,000 when it does.  This means that a score of $56,000 = 0.6(70,000) + 0.4(35,000)$ should be reported as the long term expected performance of $\vec{\theta_1}$.  If $\vec{\theta_1}$ is tested 3 times, the probability of 0.375 that the AI would report a score of 47,333 by reaching 4096 only once in the three runs.   With 9 repeats, the probability of 47,333 or a worse score being reported (reaching 4096 3 times or fewer of 9) is 0.099.  Running the parameters 9 times drastically reduces the noisiness of the data, which provides a much more accurate representation of parameter performance. This enables the Bayesian optimization model (as discussed in section 6.1) to make more accurate predictions about how future sets of parameters will perform.
\\
\section{Future Directions}

\subsection{Parameter Optimization}

This is an ongoing project, and the next thing I hope to implement is the optimization of $\vec{\theta}$, the parameter vector which controls how many random Monte Carlo searches to perform for a given number of empty tiles.  The goal is to find the optimal tradeoff of time vs score.  Below is a rough outline of the plan:
\begin{enumerate}
	\item Use a Quasi-Monte Carlo sampling technique with low discrepancy (such as Latin Hypercube Sampling or Sobol sampling) to draw a series of samples within the parameter space.
	\item Repeatedly use Bayesian Optimization to determine which sets of parameters to evaluate next. (Implementation details such as Acquisition function not yet sorted out)
\end{enumerate}

I intend to optimize for both move speed and score, and find the pareto efficient solution set which lets me choose the optimal speed and average score of the final model.   Bayesian optimization is likely a strong framework for parameter optimization because it calls the target function (which can be non-differentiable) fewer times than other methods like evolutionary algorithms.

\noindent{\textbf{Contingency plan for parameter optimization:}\\}

If Bayesian optimization does not produce satisfactory results, I will use parameter performance data it generates to train a neural network, which will serve as a surrogate model (by Universal approximation theorem).  The surrogate model will serve as a cheap-to-run approximation of the performance metrics obtained by a set of parameters.  I can find the minimum of the surrogate model using an evolutionary algorithm, and then evaluate those parameters on the true model.  Obtaining new data this way and retraining the model would ideally converge to a global minimum after repeating the process.

\subsection{Dynamic Policy Switching}
It is possible that when there are few tiles on the board, and explicit exploration of greater depth would be beneficial.  Dynamically switching to an Expectimax algorithm seems like it may have strong results here.  Expectimax uses Equation \ref{Bellman_policy} with a recursive value function given by Equation  \ref{expectimax_value_fn}.

\begin{equation}
\label{expectimax_value_fn}
V(s, \vec{\theta}) = \sum_{s^\prime \in S} P(s^\prime|s, a)(R(s, s^\prime) + \gamma V(s^\prime, \vec{\theta}))
\end{equation}

Equation \ref{expectimax_value_fn} would evaluate to a certain recursive depth, and then call Algorithm \ref{value_function} as the final evaulation of $V(s, \vec{\theta})$ to estimate the value at the final search state.  This strategy would be particularly useful when there are few open tiles on the board because the low number of states to search increases the maximum search depth in a given amount of time.

%The primary area for improvement is parameter optimization.  In particular, the $\vec{\theta}$ parameter described in section 3.2 could be optimized to find... NOT DONE
%MAIN IDEAS FOR FUTURE DIRECTIONS:\\
%- optimize $\vec{\theta}$ using surrogate modelling:
%\begin{enumerate}
%	\item Evaluate model with a series of parameters created by latin hypercube sampling
%	\item Design cost function which should be minimized to have "optimal performance". Will need to trade off speed and performance. probably ideally will have on average 2 seconds per move and reward higher scores.
%	\item EITHER Train some sort of machine learning model to approximate the underlying function, and then minimize that function's cost
%	\item OR use Bayesian optimization so I can get a better sense of uncertainty.
%	\item ideally make it multi-objective so I can find the pareto optimal frontier and decide on my own optimal tradeoff of speed vs strength.
%\end{enumerate}


%Use heuristics\\
%Note that part of the challenge here was not to use heuristics
%https://arxiv.org/abs/2110.10374
%These stanford math profs made a deep reinforcement learning model that doesnt use heuristics and mine is better than it I think (theirs seems to be for a class they taught rather than research really). Check performance to be sure.

%dynamically switch to minimax or expectimax when there are few open tiles.\\


% \begin{tikzpicture}[
%     level distance=15mm,
%     every node/.style={
%         rectangle,
%         rounded corners,
%         inner sep=4pt,
%         text=white,
%         font=\scriptsize,
%         anchor=center
%     },
%     level 1/.style={sibling distance=25mm,nodes={fill={rgb:red,0;green,154;blue,58}}},
%     level 2/.style={sibling distance=30mm,nodes={fill={rgb:red,255;green,20;blue,20}}},
%     level 3/.style={sibling distance=15mm,nodes={fill={rgb:red,255;green,20;blue,20}}}
%     ]
    
%     \node[fill={rgb:red,30;green,148;blue,230}] {Current State}
%      child {node {Move Right}}
%      child {node {Move Left}
%        child {node {4 spawns}
%          child {node {tile 1}}
%          child {node {tile m}}
%        }
%        child {node {2 spawns}
%          child {node {tile 1}}
%          child {node {tile m}}
%        }
%      }
%      child {node {Move Up}}
%      child {node {Move Down}};
% \end{tikzpicture}


% \begin{tikzpicture}[
%     level distance = 1.5cm,
%     level 1/.style={sibling distance=8cm},
%     level 2/.style={sibling distance=4cm},
%     level 3/.style={sibling distance=2cm},
%     edge from parent/.style={draw}
% ]

% \node {Current State}
%     child {
%         node {Move Left}
%         child {
%             node {2 Tile Spawns}
%             child {
%                 node {level 3}
%                 child {node {level 4} edge from parent node[left] {$1/m$}}
%                 child {node {level 4} edge from parent node[right] {$1/m$}}
%                 edge from parent node[left] {0.9}
%             }
%             child {
%                 node {level 3}
%                 child {node {level 4} edge from parent node[left] {$1/m$}}
%                 child {node {level 4} edge from parent node[right] {$1/m$}}
%                 edge from parent node[right] {0.1}
%             }
%         }
%         child {
%             node {4 Tile Spawns}
%             child {
%                 node {level 3}
%                 child {node {level 4} edge from parent node[left] {$1/m$}}
%                 child {node {level 4} edge from parent node[right] {$1/m$}}
%                 edge from parent node[left] {0.9}
%             }
%             child {
%                 node {level 3}
%                 child {node {level 4} edge from parent node[left] {$1/m$}}
%                 child {node {level 4} edge from parent node[right] {$1/m$}}
%                 edge from parent node[right] {0.1}
%             }
%         }
%     }
%     child {
%         node {Move Right}
%     };

% % Braces and labels
% \draw [decorate,decoration={brace,amplitude=10pt}] (5.5,0) -- (5.5,-1) node[midway,right=10pt]{level 1};
% \draw [decorate,decoration={brace,amplitude=10pt}] (5.5,-1.1) -- (5.5,-2.1) node[midway,right=10pt]{$A$};
% \draw [decorate,decoration={brace,amplitude=10pt}] (5.5,-2.2) -- (5.5,-4.2) node[midway,right=10pt]{$S$};
% \end{tikzpicture}

% EDITING THIS ONE\\


% \begin{tikzpicture}[
%     level distance = 1.5cm,
%     level 1/.style={sibling distance=2cm},
%     level 2/.style={sibling distance=3cm},
%     level 3/.style={sibling distance=3cm},
%     edge from parent/.style={draw}
% ]

% \node {Current State}
%     child {
%             node {Move Up}
%         }
%         child {
%             node {Move Down}
%         }
%     child {
%         node {Move Left}
%         child {
%             node {2 Tile Spawns} 
%             % child {
%             %     node {Spot 1}
%             %     edge from parent node[left] {$\frac{1}{m}$}
%             % }
%             % child {
%             %     node {$\hdots$}
%             %     edge from parent node[left] {$\frac{m - 2}{m}$}
%             % }
%             % child {
%             %     node {Spot m}
%             %     edge from parent node[left] {$\frac{1}{m}$}
%             % }
%             edge from parent node[right] {$\frac{9}{10}$}
%         }
%         child {
%             node {4 Tile Spawns} 
%             child {
%                 node {Spot 1}
%                 edge from parent node[left] {$\frac{1}{m}$}
%             }
%             child {
%                 node {$\hdots$}
%                 edge from parent node[left] {$\frac{m - 2}{m}$}
%             }
%             child {
%                 node {Spot m}
%                 edge from parent node[left] {$\frac{1}{m}$}
%             }
%             edge from parent node[right] {$\frac{1}{10}$}
%         }
%     }
%     child {
%         node {Move Right}
%     }
%     ;

% % Braces and labels
% \draw [decorate,decoration={brace,amplitude=10pt}] (5.5,0) -- (5.5,-1) node[midway,right=10pt]{level 1};
% \draw [decorate,decoration={brace,amplitude=10pt}] (5.5,-1.1) -- (5.5,-2.1) node[midway,right=10pt]{$A$};
% \draw [decorate,decoration={brace,amplitude=10pt}] (5.5,-2.2) -- (5.5,-4.2) node[midway,right=10pt]{$S$};
% \end{tikzpicture}

    
% FROM CHATGPT\\

% \begin{tikzpicture}[
%     level distance = 1.5cm,
%     level 1/.style={sibling distance=2cm},
%     level 2/.style={sibling distance=4cm},
%     level 3/.style={sibling distance=1.5cm},
%     edge from parent/.style={draw}
% ]

% \node {Current State}
%     child {
%             node {Move Up}
%         }
%         child {
%             node {Move Down}
%         }
%     child {
%         node {Move Left}
%         child {
%             node {2 Tile Spawns}
%             child {
%                 node {level 3}
%                 child {node {level 4} edge from parent node[left] {$1/m$}}
%                 child {node {level 4} edge from parent node[right] {$1/m$}}
%                 edge from parent node[left] {0.9}
%             }
%             child {
%                 node {level 3}
%                 child {node {level 4} edge from parent node[left] {$1/m$}}
%                 child {node {level 4} edge from parent node[right] {$1/m$}}
%                 edge from parent node[right] {0.1}
%             }
%         }
%         child [label={[label distance=10pt, yshift=-7.5pt, xshift=5pt]right:$\frac{1}{10}$}] {
%             node {4 Tile Spawns}
%             child {
%                 node {Spot 1}
%                 edge from parent[draw] node[left] {$\frac{1}{m}$}
%             }
%             child {
%                 node {$\hdots$}
%                 edge from parent[draw] node[left] {$\frac{m - 2}{m}$}
%             }
%             child {
%                 node {Spot m}
%                 edge from parent[draw] node[left] {$\frac{1}{m}$}
%             }
%         }
%     }
%     child {
%         node {Move Right}
%     }
%     ;

% % Braces and labels
% \draw [decorate,decoration={brace,amplitude=10pt}] (5.5,0) -- (5.5,-1) node[midway,right=10pt]{level 1};
% \draw [decorate,decoration={brace,amplitude=10pt}] (5.5,-1.1) -- (5.5,-2.1) node[midway,right=10pt]{$A$};
% \draw [decorate,decoration={brace,amplitude=10pt}] (5.5,-2.2) -- (5.5,-4.2) node[midway,right=10pt]{$S$};
% \end{tikzpicture}


\end{document}