-
Notifications
You must be signed in to change notification settings - Fork 324
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
cd3eff4
commit 273889e
Showing
11 changed files
with
294 additions
and
27 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,88 @@ | ||
# List of Algorithms | ||
|
||
| \# | Caption | Page | | ||
| :--- | :--- | ---: | | ||
| Algorithm 2.1 | Check whether the policy is optimal. | 49 | | ||
| Algorithm 2.2 | Policy improvement. | 50 | | ||
| Algorithm 3.1 | Model-based numerical iterative policy evaluation to estimate state values. | 88 | | ||
| Algorithm 3.2 | Model-based numerical iterative policy evaluation to estimate action values. | 89 | | ||
| Algorithm 3.3 | Model-based numerical iterative policy evaluation to estimate action values (space-saving version). | 90 | | ||
| Algorithm 3.4 | Model-based numerical iterative policy evaluation (space-saving version, alternative implementation). | 90 | | ||
| Algorithm 3.5 | Model-based policy iteration. | 92 | | ||
| Algorithm 3.6 | Model-based policy iteration (space-saving version). | 92 | | ||
| Algorithm 3.7 | Model-based VI. | 93 | | ||
| Algorithm 3.8 | Model-based VI (space-saving version). | 93 | | ||
| Algorithm 4.1 | Evaluate action values using every-visit MC policy evaluation. | 110 | | ||
| Algorithm 4.2 | Every-visit MC update to evaluate state values. | 110 | | ||
| Algorithm 4.3 | First-visit MC update to estimate action values. | 111 | | ||
| Algorithm 4.4 | First-visit MC update to estimate state values. | 112 | | ||
| Algorithm 4.5 | MC update with exploring start (maintaining policy explicitly). | 113 | | ||
| Algorithm 4.6 | MC update with exploring start (maintaining policy implicitly). | 114 | | ||
| Algorithm 4.7 | MC update with soft policy (maintaining policy explicitly). | 116 | | ||
| Algorithm 4.8 | MC update with soft policy (maintaining policy explicitly). | 117 | | ||
| Algorithm 4.9 | Evaluate action values using off-policy MC update based on importance sampling. | 121 | | ||
| Algorithm 4.10 | Find an optimal policy using off-policy MC update based on importance sampling. | 123 | | ||
| Algorithm 5.1 | One-step TD policy evaluation to estimate action values. | 139 | | ||
| Algorithm 5.2 | One-step TD policy evaluation to estimate action values with an indicator of episode end. | 140 | | ||
| Algorithm 5.3 | One-step TD policy evaluation to estimate state values. | 141 | | ||
| Algorithm 5.4 | $n$-step TD policy evaluation to estimate action values. | 142 | | ||
| Algorithm 5.5 | $n$-step TD policy evaluation to estimate state values. | 143 | | ||
| Algorithm 5.6 | SARSA (maintaining the policy explicitly). | 144 | | ||
| Algorithm 5.7 | SARSA (maintaining the policy implicitly). | 145 | | ||
| Algorithm 5.8 | $n$-step SARSA. | 146 | | ||
| Algorithm 5.9 | Expected SARSA. | 147 | | ||
| Algorithm 5.10 | $n$-step expected SARSA. | 148 | | ||
| Algorithm 5.11 | $n$-step TD policy evaluation of SARSA with importance sampling. | 150 | | ||
| Algorithm 5.12 | Q learning. | 152 | | ||
| Algorithm 5.13 | Double Q Learning. | 154 | | ||
| Algorithm 5.14 | TD ${\left ({\lambda }\right )}$ policy evaluation or SARSA ${\left ({\lambda }\right )}$. | 158 | | ||
| Algorithm 5.15 | TD ${\left ({\lambda }\right )}$ policy evaluation to estimate state values. | 159 | | ||
| Algorithm 6.1 | Policy evaluation with function approximation and SGD. | 177 | | ||
| Algorithm 6.2 | Policy optimization with function approximation and SGD. | 177 | | ||
| Algorithm 6.3 | Semi-gradient descent policy evaluation to estimate action values or SARSA policy optimization. | 178 | | ||
| Algorithm 6.4 | Semi-gradient descent policy evaluation to estimate state values, or expected SARSA policy optimization, or Q learning. | 179 | | ||
| Algorithm 6.5 | TD ${\left ({\lambda }\right )}$ policy evaluation for action values or SARSA. | 181 | | ||
| Algorithm 6.6 | TD ${\left ({\lambda }\right )}$ policy evaluation for state values, or expected SARSA, or Q learning. | 181 | | ||
| Algorithm 6.7 | DQN policy optimization with experience replay (loop over episodes). | 187 | | ||
| Algorithm 6.8 | DQN policy optimization with experience replay (without looping over episodes explicitly). | 188 | | ||
| Algorithm 6.9 | DQN with experience replay and target network. | 191 | | ||
| Algorithm 7.1 | VPG policy optimization. | 220 | | ||
| Algorithm 7.2 | VPG policy optimization with baseline. | 222 | | ||
| Algorithm 7.3 | Importance sampling PG policy optimization. | 223 | | ||
| Algorithm 8.1 | Action-value on-policy AC. | 239 | | ||
| Algorithm 8.2 | Advantage AC. | 240 | | ||
| Algorithm 8.3 | A3C (one-step TD version, showing the behavior of one worker). | 240 | | ||
| Algorithm 8.4 | Advantage AC with eligibility trace. | 242 | | ||
| Algorithm 8.5 | Clipped PPO (simplified version). | 246 | | ||
| Algorithm 8.6 | Clipped PPO (with on-policy experience replay). | 246 | | ||
| Algorithm 8.7 | Vanilla NPG. | 253 | | ||
| Algorithm 8.8 | CG. | 255 | | ||
| Algorithm 8.9 | NPG with CG. | 255 | | ||
| Algorithm 8.10 | TRPO. | 257 | | ||
| Algorithm 8.11 | OffPAC. | 258 | | ||
| Algorithm 9.1 | Vanilla on-policy deterministic AC. | 292 | | ||
| Algorithm 9.2 | OPDAC. | 294 | | ||
| Algorithm 9.3 | DDPG. | 295 | | ||
| Algorithm 9.4 | TD3. | 297 | | ||
| Algorithm 10.1 | SQL. | 326 | | ||
| Algorithm 10.2 | SAC. | 328 | | ||
| Algorithm 10.3 | SAC with automatic entropy adjustment. | 331 | | ||
| Algorithm 11.1 | ES. | 356 | | ||
| Algorithm 11.2 | ARS. | 358 | | ||
| Algorithm 12.1 | Categorical DQN to find the optimal policy (to maximize expectation). | 377 | | ||
| Algorithm 12.2 | Categorical DQN to find the optimal policy (to maximize VNM utility). | 378 | | ||
| Algorithm 12.3 | QR-DQN to Find the Optimal Policy (To Maximize Expectation). | 382 | | ||
| Algorithm 12.4 | IQN to find the optimal policy (to maximize expectation). | 384 | | ||
| Algorithm 12.5 | Categorical DQN to find the optimal policy (use Yarri distortion function). | 387 | | ||
| Algorithm 13.1 | $\varepsilon $-greedy. | 414 | | ||
| Algorithm 13.2 | UCB (including UCB1). | 415 | | ||
| Algorithm 13.3 | Bayesian UCB. | 421 | | ||
| Algorithm 13.4 | Thompson Sampling. | 422 | | ||
| Algorithm 13.5 | UCBVI. | 423 | | ||
| Algorithm 14.1 | MCTS. | 435 | | ||
| Algorithm 14.2 | AlphaZero. | 441 | | ||
| Algorithm 14.3 | MuZero. | 442 | | ||
| Algorithm 15.1 | Semi-gradient descent policy evaluation to estimate action values or SARSA policy optimization. | 484 | | ||
| Algorithm 15.2 | Semi-gradient descent differential expected SARSA policy optimization, or differential Q learning. | 485 | | ||
| Algorithm 15.3 | Model-based VI for fixed-horizontal episode. | 492 | | ||
| Algorithm 16.1 | GAIL-PPO. | 543 | |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
# List of Figures | ||
|
||
| \# | Caption | Page | | ||
| :--- | :--- | ---: | | ||
| Figure 1.1 | Robot in a maze. | 2 | | ||
| Figure 1.2 | PacMan in Atari 2600. | 4 | | ||
| Figure 1.3 | A record for a game of Go. | 4 | | ||
| Figure 1.4 | Bipedal walker. | 5 | | ||
| Figure 1.5 | Large language models. | 5 | | ||
| Figure 1.6 | Agent–environment interface. | 5 | | ||
| Figure 1.7 | Taxonomy of RL. | 8 | | ||
| Figure 1.8 | Relationship among RL, DL, and DRL. | 11 | | ||
| Figure 2.1 | State transition graph of the example. | 26 | | ||
| Figure 2.2 | Compare trajectories of DTMP, DTMRP, and DTMDP. | 28 | | ||
| Figure 2.3 | State transition graph of the example "Feed and Full". | 29 | | ||
| Figure 2.4 | Backup diagram that state values and action values represent each other. | 40 | | ||
| Figure 2.5 | State values and action values back up themselves. | 42 | | ||
| Figure 2.6 | Backup diagram for optimal state values and optimal action values backing up each other. | 64 | | ||
| Figure 2.7 | Backup diagram for optimal state values and optimal action values backing up themselves. | 65 | | ||
| Figure 2.8 | Grid of the task `CliffWalking-v0`. | 72 | | ||
| Figure 3.1 | Policy improvement. | 91 | | ||
| Figure 3.2 | Illustration of bootstrap. | 95 | | ||
| Figure 4.1 | An example task of Monte Carlo. | 106 | | ||
| Figure 4.2 | An example where the optimal policy may not be found without exploring start. | 113 | | ||
| Figure 4.3 | State value estimates obtained by policy evaluation algorithm. | 128 | | ||
| Figure 4.4 | Optimal policy estimates. | 129 | | ||
| Figure 4.5 | Optimal state value estimates. | 130 | | ||
| Figure 5.1 | Backup diagram of TD return and MC return. | 138 | | ||
| Figure 5.2 | Maximization bias in Q learning. | 153 | | ||
| Figure 5.3 | Backup diagram of $\lambda$ return. | 156 | | ||
| Figure 5.4 | Compare different eligibility traces. | 158 | | ||
| Figure 5.5 | ASCII map of the task `Taxi-v3`. | 160 | | ||
| Figure 6.1 | MDP in Baird's counterexample. | 184 | | ||
| Figure 6.2 | Trend of parameters with iterations. | 185 | | ||
| Figure 6.3 | The task `MountainCar-v0`. | 195 | | ||
| Figure 6.4 | Position and velocity of the car when it is always pushed right. | 196 | | ||
| Figure 6.5 | One-hot coding and tile coding. | 197 | | ||
| Figure 7.1 | The cart-pole problem. | 224 | | ||
| Figure 8.1 | Illustration of MM algorithm. | 244 | | ||
| Figure 8.2 | Relationship among $g_{\pi\left({\mathbf\uptheta}\right)}$, $l\left({\mathbf\uptheta}\middle\vert{\mathbf\uptheta_k}\right)$, and $l_c\left({\mathbf\uptheta}\middle\vert{\mathbf\uptheta_k}\right)$. | 252 | | ||
| Figure 8.3 | The task `Acrobot-v1`. | 259 | | ||
| Figure 9.1 | The task `Pendulum-v1`. | 300 | | ||
| Figure 12.1 | Some Atari games. | 390 | | ||
| Figure 12.2 | Neural network for Categorical DQN. | 398 | | ||
| Figure 12.3 | Neural network for IQN. | 403 | | ||
| Figure 14.1 | Search tree. | 434 | | ||
| Figure 14.2 | Steps of MCTS. | 436 | | ||
| Figure 14.3 | First two steps of the reversi opening "Chimney". | 446 | | ||
| Figure 14.4 | Game tree of Tic-Tac-Toe. | 448 | | ||
| Figure 14.5 | Maximin decision of Tic-Tac-Toe. | 449 | | ||
| Figure 14.6 | MCTS with self-play. | 450 | | ||
| Figure 14.7 | Reverse the color of all pieces on the board. | 452 | | ||
| Figure 14.9 | Residual network. | 452 | | ||
| Figure 14.8 | Example structure prediction network for the game of Go. | 453 | | ||
| Figure 15.1 | MDP of the task "Tiger". | 501 | | ||
| Figure 15.2 | Trajectories maintained by the environment and the agent. | 503 | | ||
| Figure 15.3 | Belief MDP of the task "Tiger". | 507 | | ||
| Figure 16.1 | Learning from feedbacks. | 526 | | ||
| Figure 16.2 | Agent–environment interface of IL. | 531 | | ||
| Figure 16.3 | Compounding error of imitation policy. | 541 | | ||
| Figure 16.4 | Training GPT. | 545 | | ||
| Figure 16.5 | Principal axes and Euler's angles. | 548 | |
Oops, something went wrong.