diff --git a/README.md b/README.md index 940a392..4fd5479 100644 --- a/README.md +++ b/README.md @@ -21,6 +21,9 @@ This is a tutorial book on reinforcement learning, with explanation of theory an Check [here](https://github.com/ZhiqingXiao/rl-book/tree/master/en2024) for codes, exercise answers, etc. +Check [SpringerLink](https://doi.org/10.1007/978-981-19-4933-3) [Amazon](https://www.amazon.com/dp/9811949328) for book contents. + + ### Table of Codes All codes have been saved as a .ipynb file and a .html file in the same directory. @@ -46,7 +49,7 @@ All codes have been saved as a .ipynb file and a .html file in the same director | 16 | [HumanoidBulletEnv-v0](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_ClosedForm_demo.html) | BehaviorClone [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_BC_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_BC_torch.html), GAIL [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_GAILPPO_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_GAILPPO_torch.html) | -# 强化学习:原理与Python实战 +# 强化学习:原理与Python实战 (2023 中文版) **全球第一本配套 TensorFlow 2 和 PyTorch 1/2 对照代码的强化学习教程书** @@ -89,7 +92,7 @@ All codes have been saved as a .ipynb file and a .html file in the same director 本书介绍强化学习理论及其 Python 实现。 - 理论完备:全书用一套完整的数学体系,严谨地讲授强化学习的理论基础,主要定理均给出证明过程。各章内容循序渐进,覆盖了所有主流强化学习算法,包括资格迹等非深度强化学习算法和柔性执行者/评论者等深度强化学习算法。 -- 案例丰富:在您最爱的操作系统(包括 Windows、macOS、Linux)上,基于 Python 3、Gym 0.26 和 TensorFlow 2 + PyTorch 1/2,实现强化学习算法。全书实现统一规范,体积小、重量轻。第 1~9 章给出了算法的配套实现,环境部分只依赖于 Gym 的最小安装,在没有 GPU 的计算机上也可运行;第 10~12 章介绍了多个热门综合案例,涵盖 Gym 的完整安装和自定义扩展,在有普通 GPU 的计算机上即可运行。 +- 案例丰富:在您最爱的操作系统(包括 Windows、macOS、Linux)上,基于 Python 3、Gym 0.26 和 TensorFlow 2,实现强化学习算法。全书实现统一规范,体积小、重量轻。第 1~9 章给出了算法的配套实现,环境部分只依赖于 Gym 的最小安装,在没有 GPU 的计算机上也可运行;第 10~12 章介绍了多个热门综合案例,涵盖 Gym 的完整安装和自定义扩展,在有普通 GPU 的计算机上即可运行。 **QQ群** diff --git a/en2024/README.md b/en2024/README.md index a5ec59e..26bf0e2 100644 --- a/en2024/README.md +++ b/en2024/README.md @@ -1,6 +1,8 @@ # Reinforcement Learning: Theory and Python Implementation -**The First Reinforcement Learning Tutorial Book with one-on-one mapping TensorFlow 2 and PyTorch 1/2 Implementation** +**The First Reinforcement Learning Tutorial Book in English with one-on-one mapping TensorFlow 2 and PyTorch 1/2 Implementation** + +**Cover RL algorithms for large models such as PPO, RLHF, IRL, and PbRL** Please email me if you are interested in publishing this book in other languages. @@ -8,7 +10,7 @@ Please email me if you are interested in publishing this book in other languages This book comprehensively introduces the mainstream RL theory. -- This book introduces the trunk of the modern RL theory in a systematically way. All major results are accompanied with proofs. We introduce the algorithms based on the theory, which covers all mainstream RL algorithms, including both classical RL algorithms such as eligibility trace and deep RL algorithm such as MuZero. +- This book introduces the trunk of the modern RL theory in a systematically way. All major results are accompanied with proofs. We introduce the algorithms based on the theory, which covers all mainstream RL algorithms, including the algorithms in large model era such as PPO, RLHF, IRL, and PbRL. - This book uses a consistent set of mathematical notations, which are compatible with mainstream RL tutorials. All chapters are accompanied with Python codes. @@ -22,22 +24,22 @@ All chapters are accompanied with Python codes. ### Table of Contents -01. Introduction of Reinforcement Learning -02. Markov Decision Process -03. Model-based Numeric Iteration -04. MC: Monte-Carlo Learning -05. TD: Temporal Difference Learning -06. Function Approximation -07. PG: Policy Gradient -08. AC: Actor-Critic -09. DPG: Deterministic Policy Gradient -10. Maximum-Entropy RL -11. Policy-based Gradient-Free Algorithms -12. Distributional RL -13. Minimize Regret -14. Tree Search -15. More Agent-Environment Interface -16. Learning from Feedback and Imitation Learning +01. Introduction of Reinforcement Learning [view](https://doi.org/10.1007/978-981-19-4933-3_1) +02. Markov Decision Process [view](https://doi.org/10.1007/978-981-19-4933-3_2) +03. Model-based Numeric Iteration [view](https://doi.org/10.1007/978-981-19-4933-3_3) +04. MC: Monte-Carlo Learning [view](https://doi.org/10.1007/978-981-19-4933-3_4) +05. TD: Temporal Difference Learning [view](https://doi.org/10.1007/978-981-19-4933-3_5) +06. Function Approximation [view](https://doi.org/10.1007/978-981-19-4933-3_6) +07. PG: Policy Gradient [view](https://doi.org/10.1007/978-981-19-4933-3_7) +08. AC: Actor-Critic [view](https://doi.org/10.1007/978-981-19-4933-3_8) +09. DPG: Deterministic Policy Gradient [view](https://doi.org/10.1007/978-981-19-4933-3_9) +10. Maximum-Entropy RL [view](https://doi.org/10.1007/978-981-19-4933-3_10) +11. Policy-based Gradient-Free Algorithms [view](https://doi.org/10.1007/978-981-19-4933-3_11) +12. Distributional RL [view](https://doi.org/10.1007/978-981-19-4933-3_12) +13. Minimize Regret [view](https://doi.org/10.1007/978-981-19-4933-3_13) +14. Tree Search [view](https://doi.org/10.1007/978-981-19-4933-3_14) +15. More Agent-Environment Interface [view](https://doi.org/10.1007/978-981-19-4933-3_15) +16. Learning from Feedback and Imitation Learning [view](https://doi.org/10.1007/978-981-19-4933-3_16) ### Resources @@ -45,6 +47,10 @@ All chapters are accompanied with Python codes. - Guide to set up developing environment: [Windows](https://github.com/ZhiqingXiao/rl-book/blob/master/en2024/setup/setupwin.md) [macOS](https://github.com/ZhiqingXiao/rl-book/blob/master/en2024/setup/setupmac.md) - Table of notations: [link](https://github.com/ZhiqingXiao/rl-book/blob/master/en2024/notation.md) - Table of abbreviations: [link](https://github.com/zhiqingxiao/rl-book/blob/master/en2024/abbreviation.md) +- Table of algorithms: [link](https://github.com/zhiqingxiao/rl-book/blob/master/en2024/algo.md) +- Table of interdisciplinary references: [link](https://github.com/zhiqingxiao/rl-book/blob/master/en2024/iref.md) +- Table of figures: [link](https://github.com/zhiqingxiao/rl-book/blob/master/en2024/figure.md) +- Table of tables: [link](https://github.com/zhiqingxiao/rl-book/blob/master/en2024/table.md) - Gym Internal: [link](https://github.com/ZhiqingXiao/rl-book/blob/master/en2024/gym.md) - Bibliography: [link](https://github.com/zhiqingxiao/rl-book/blob/master/en2024/bibliography.md) @@ -83,17 +89,20 @@ Note: title = {Reinforcement Learning: Theory and {Python} Implementation}, author = {Zhiqing Xiao} year = 2024, - publisher = {Springer Nature}, + month = 9, + publisher = {Springer}, } # 强化学习:原理与Python实现(英文版) -**全球第一本配套 TensorFlow 2 和 PyTorch 1/2 一比一对照代码的强化学习书** +**全球第一本配套 TensorFlow 2 和 PyTorch 1/2 一比一对照代码的英文强化学习书** + +**解密大模型训练技术 PPO、RLHF、IRL和PbRL** ### 本书特色 本书完整地介绍了主流强化学习理论。 -- 选用现代强化学习理论体系,突出主干,主要定理均给出证明过程。基于理论讲解强化学习算法,全面覆盖主流强化学习算法,包括了资格迹等经典算法和MuZero等深度强化学习算法。 +- 选用现代强化学习理论体系,突出主干,主要定理均给出证明过程。基于理论讲解强化学习算法,全面覆盖主流强化学习算法,包括了资格迹等经典算法和MuZero等深度强化学习算法。涵盖大模型时代的常用算法如PPO、RLHF、IRL、PbRL等。 - 全书采用完整的数学体系,各章内容循序渐进。全书采用一致的数学符号,并兼容主流强化学习教程。 本书各章均提供Python代码,实战性强。 diff --git a/en2024/algo.md b/en2024/algo.md new file mode 100644 index 0000000..ab377c6 --- /dev/null +++ b/en2024/algo.md @@ -0,0 +1,88 @@ +# List of Algorithms + +| \# | Caption | Page | +| :--- | :--- | ---: | +| Algorithm 2.1 | Check whether the policy is optimal. | 49 | +| Algorithm 2.2 | Policy improvement. | 50 | +| Algorithm 3.1 | Model-based numerical iterative policy evaluation to estimate state values. | 88 | +| Algorithm 3.2 | Model-based numerical iterative policy evaluation to estimate action values. | 89 | +| Algorithm 3.3 | Model-based numerical iterative policy evaluation to estimate action values (space-saving version). | 90 | +| Algorithm 3.4 | Model-based numerical iterative policy evaluation (space-saving version, alternative implementation). | 90 | +| Algorithm 3.5 | Model-based policy iteration. | 92 | +| Algorithm 3.6 | Model-based policy iteration (space-saving version). | 92 | +| Algorithm 3.7 | Model-based VI. | 93 | +| Algorithm 3.8 | Model-based VI (space-saving version). | 93 | +| Algorithm 4.1 | Evaluate action values using every-visit MC policy evaluation. | 110 | +| Algorithm 4.2 | Every-visit MC update to evaluate state values. | 110 | +| Algorithm 4.3 | First-visit MC update to estimate action values. | 111 | +| Algorithm 4.4 | First-visit MC update to estimate state values. | 112 | +| Algorithm 4.5 | MC update with exploring start (maintaining policy explicitly). | 113 | +| Algorithm 4.6 | MC update with exploring start (maintaining policy implicitly). | 114 | +| Algorithm 4.7 | MC update with soft policy (maintaining policy explicitly). | 116 | +| Algorithm 4.8 | MC update with soft policy (maintaining policy explicitly). | 117 | +| Algorithm 4.9 | Evaluate action values using off-policy MC update based on importance sampling. | 121 | +| Algorithm 4.10 | Find an optimal policy using off-policy MC update based on importance sampling. | 123 | +| Algorithm 5.1 | One-step TD policy evaluation to estimate action values. | 139 | +| Algorithm 5.2 | One-step TD policy evaluation to estimate action values with an indicator of episode end. | 140 | +| Algorithm 5.3 | One-step TD policy evaluation to estimate state values. | 141 | +| Algorithm 5.4 | $n$-step TD policy evaluation to estimate action values. | 142 | +| Algorithm 5.5 | $n$-step TD policy evaluation to estimate state values. | 143 | +| Algorithm 5.6 | SARSA (maintaining the policy explicitly). | 144 | +| Algorithm 5.7 | SARSA (maintaining the policy implicitly). | 145 | +| Algorithm 5.8 | $n$-step SARSA. | 146 | +| Algorithm 5.9 | Expected SARSA. | 147 | +| Algorithm 5.10 | $n$-step expected SARSA. | 148 | +| Algorithm 5.11 | $n$-step TD policy evaluation of SARSA with importance sampling. | 150 | +| Algorithm 5.12 | Q learning. | 152 | +| Algorithm 5.13 | Double Q Learning. | 154 | +| Algorithm 5.14 | TD ${\left ({\lambda }\right )}$ policy evaluation or SARSA ${\left ({\lambda }\right )}$. | 158 | +| Algorithm 5.15 | TD ${\left ({\lambda }\right )}$ policy evaluation to estimate state values. | 159 | +| Algorithm 6.1 | Policy evaluation with function approximation and SGD. | 177 | +| Algorithm 6.2 | Policy optimization with function approximation and SGD. | 177 | +| Algorithm 6.3 | Semi-gradient descent policy evaluation to estimate action values or SARSA policy optimization. | 178 | +| Algorithm 6.4 | Semi-gradient descent policy evaluation to estimate state values, or expected SARSA policy optimization, or Q learning. | 179 | +| Algorithm 6.5 | TD ${\left ({\lambda }\right )}$ policy evaluation for action values or SARSA. | 181 | +| Algorithm 6.6 | TD ${\left ({\lambda }\right )}$ policy evaluation for state values, or expected SARSA, or Q learning. | 181 | +| Algorithm 6.7 | DQN policy optimization with experience replay (loop over episodes). | 187 | +| Algorithm 6.8 | DQN policy optimization with experience replay (without looping over episodes explicitly). | 188 | +| Algorithm 6.9 | DQN with experience replay and target network. | 191 | +| Algorithm 7.1 | VPG policy optimization. | 220 | +| Algorithm 7.2 | VPG policy optimization with baseline. | 222 | +| Algorithm 7.3 | Importance sampling PG policy optimization. | 223 | +| Algorithm 8.1 | Action-value on-policy AC. | 239 | +| Algorithm 8.2 | Advantage AC. | 240 | +| Algorithm 8.3 | A3C (one-step TD version, showing the behavior of one worker). | 240 | +| Algorithm 8.4 | Advantage AC with eligibility trace. | 242 | +| Algorithm 8.5 | Clipped PPO (simplified version). | 246 | +| Algorithm 8.6 | Clipped PPO (with on-policy experience replay). | 246 | +| Algorithm 8.7 | Vanilla NPG. | 253 | +| Algorithm 8.8 | CG. | 255 | +| Algorithm 8.9 | NPG with CG. | 255 | +| Algorithm 8.10 | TRPO. | 257 | +| Algorithm 8.11 | OffPAC. | 258 | +| Algorithm 9.1 | Vanilla on-policy deterministic AC. | 292 | +| Algorithm 9.2 | OPDAC. | 294 | +| Algorithm 9.3 | DDPG. | 295 | +| Algorithm 9.4 | TD3. | 297 | +| Algorithm 10.1 | SQL. | 326 | +| Algorithm 10.2 | SAC. | 328 | +| Algorithm 10.3 | SAC with automatic entropy adjustment. | 331 | +| Algorithm 11.1 | ES. | 356 | +| Algorithm 11.2 | ARS. | 358 | +| Algorithm 12.1 | Categorical DQN to find the optimal policy (to maximize expectation). | 377 | +| Algorithm 12.2 | Categorical DQN to find the optimal policy (to maximize VNM utility). | 378 | +| Algorithm 12.3 | QR-DQN to Find the Optimal Policy (To Maximize Expectation). | 382 | +| Algorithm 12.4 | IQN to find the optimal policy (to maximize expectation). | 384 | +| Algorithm 12.5 | Categorical DQN to find the optimal policy (use Yarri distortion function). | 387 | +| Algorithm 13.1 | $\varepsilon $-greedy. | 414 | +| Algorithm 13.2 | UCB (including UCB1). | 415 | +| Algorithm 13.3 | Bayesian UCB. | 421 | +| Algorithm 13.4 | Thompson Sampling. | 422 | +| Algorithm 13.5 | UCBVI. | 423 | +| Algorithm 14.1 | MCTS. | 435 | +| Algorithm 14.2 | AlphaZero. | 441 | +| Algorithm 14.3 | MuZero. | 442 | +| Algorithm 15.1 | Semi-gradient descent policy evaluation to estimate action values or SARSA policy optimization. | 484 | +| Algorithm 15.2 | Semi-gradient descent differential expected SARSA policy optimization, or differential Q learning. | 485 | +| Algorithm 15.3 | Model-based VI for fixed-horizontal episode. | 492 | +| Algorithm 16.1 | GAIL-PPO. | 543 | diff --git a/en2024/cover.jpg b/en2024/cover.jpg index 37b775c..3dc856d 100644 Binary files a/en2024/cover.jpg and b/en2024/cover.jpg differ diff --git a/en2024/figure.md b/en2024/figure.md new file mode 100644 index 0000000..6a7bdf2 --- /dev/null +++ b/en2024/figure.md @@ -0,0 +1,62 @@ +# List of Figures + +| \# | Caption | Page | +| :--- | :--- | ---: | +| Figure 1.1 | Robot in a maze. | 2 | +| Figure 1.2 | PacMan in Atari 2600. | 4 | +| Figure 1.3 | A record for a game of Go. | 4 | +| Figure 1.4 | Bipedal walker. | 5 | +| Figure 1.5 | Large language models. | 5 | +| Figure 1.6 | Agent–environment interface. | 5 | +| Figure 1.7 | Taxonomy of RL. | 8 | +| Figure 1.8 | Relationship among RL, DL, and DRL. | 11 | +| Figure 2.1 | State transition graph of the example. | 26 | +| Figure 2.2 | Compare trajectories of DTMP, DTMRP, and DTMDP. | 28 | +| Figure 2.3 | State transition graph of the example "Feed and Full". | 29 | +| Figure 2.4 | Backup diagram that state values and action values represent each other. | 40 | +| Figure 2.5 | State values and action values back up themselves. | 42 | +| Figure 2.6 | Backup diagram for optimal state values and optimal action values backing up each other. | 64 | +| Figure 2.7 | Backup diagram for optimal state values and optimal action values backing up themselves. | 65 | +| Figure 2.8 | Grid of the task `CliffWalking-v0`. | 72 | +| Figure 3.1 | Policy improvement. | 91 | +| Figure 3.2 | Illustration of bootstrap. | 95 | +| Figure 4.1 | An example task of Monte Carlo. | 106 | +| Figure 4.2 | An example where the optimal policy may not be found without exploring start. | 113 | +| Figure 4.3 | State value estimates obtained by policy evaluation algorithm. | 128 | +| Figure 4.4 | Optimal policy estimates. | 129 | +| Figure 4.5 | Optimal state value estimates. | 130 | +| Figure 5.1 | Backup diagram of TD return and MC return. | 138 | +| Figure 5.2 | Maximization bias in Q learning. | 153 | +| Figure 5.3 | Backup diagram of $\lambda$ return. | 156 | +| Figure 5.4 | Compare different eligibility traces. | 158 | +| Figure 5.5 | ASCII map of the task `Taxi-v3`. | 160 | +| Figure 6.1 | MDP in Baird's counterexample. | 184 | +| Figure 6.2 | Trend of parameters with iterations. | 185 | +| Figure 6.3 | The task `MountainCar-v0`. | 195 | +| Figure 6.4 | Position and velocity of the car when it is always pushed right. | 196 | +| Figure 6.5 | One-hot coding and tile coding. | 197 | +| Figure 7.1 | The cart-pole problem. | 224 | +| Figure 8.1 | Illustration of MM algorithm. | 244 | +| Figure 8.2 | Relationship among $g_{\pi\left({\mathbf\uptheta}\right)}$, $l\left({\mathbf\uptheta}\middle\vert{\mathbf\uptheta_k}\right)$, and $l_c\left({\mathbf\uptheta}\middle\vert{\mathbf\uptheta_k}\right)$. | 252 | +| Figure 8.3 | The task `Acrobot-v1`. | 259 | +| Figure 9.1 | The task `Pendulum-v1`. | 300 | +| Figure 12.1 | Some Atari games. | 390 | +| Figure 12.2 | Neural network for Categorical DQN. | 398 | +| Figure 12.3 | Neural network for IQN. | 403 | +| Figure 14.1 | Search tree. | 434 | +| Figure 14.2 | Steps of MCTS. | 436 | +| Figure 14.3 | First two steps of the reversi opening "Chimney". | 446 | +| Figure 14.4 | Game tree of Tic-Tac-Toe. | 448 | +| Figure 14.5 | Maximin decision of Tic-Tac-Toe. | 449 | +| Figure 14.6 | MCTS with self-play. | 450 | +| Figure 14.7 | Reverse the color of all pieces on the board. | 452 | +| Figure 14.9 | Residual network. | 452 | +| Figure 14.8 | Example structure prediction network for the game of Go. | 453 | +| Figure 15.1 | MDP of the task "Tiger". | 501 | +| Figure 15.2 | Trajectories maintained by the environment and the agent. | 503 | +| Figure 15.3 | Belief MDP of the task "Tiger". | 507 | +| Figure 16.1 | Learning from feedbacks. | 526 | +| Figure 16.2 | Agent–environment interface of IL. | 531 | +| Figure 16.3 | Compounding error of imitation policy. | 541 | +| Figure 16.4 | Training GPT. | 545 | +| Figure 16.5 | Principal axes and Euler's angles. | 548 | diff --git a/en2024/iref.md b/en2024/iref.md new file mode 100644 index 0000000..eb62c2d --- /dev/null +++ b/en2024/iref.md @@ -0,0 +1,52 @@ +# List of Interdisciplinary References + +| \# | Caption | Page | +| :--- | :--- | ---: | +| Interdisciplinary Reference 1.1 | Behavior Psychology: Reinforcement Learning | 2 | +| Interdisciplinary Reference 2.1 | Stochastic Process: Markov Process | 25 | +| Interdisciplinary Reference 2.2 | Optimization: Duality in Linear Programming | 69 | +| Interdisciplinary Reference 3.1 | Functional Analysis: Metric and its Completeness | 82 | +| Interdisciplinary Reference 3.2 | Functional Analysis: Contraction Mapping | 83 | +| Interdisciplinary Reference 3.3 | Functional Analysis: Fixed Point | 86 | +| Interdisciplinary Reference 3.4 | Functional Analysis: Banach Fixed Point Theorem | 86 | +| Interdisciplinary Reference 3.5 | Statistics: Bootstrap | 94 | +| Interdisciplinary Reference 3.6 | Algorithm: Dynamic Programming | 95 | +| Interdisciplinary Reference 4.1 | Statistics: Monte Carlo Method | 106 | +| Interdisciplinary Reference 4.2 | Stochastic Approximation: Robbins–Monro Algorithm | 107 | +| Interdisciplinary Reference 4.3 | Statistics: Importance Sampling | 118 | +| Interdisciplinary Reference 6.1 | Machine Learning: Parametric Model and Nonparametric Model | 172 | +| Interdisciplinary Reference 6.2 | Stochastic Optimization: Stochastic Gradient Descent | 176 | +| Interdisciplinary Reference 6.3 | Data Structure: Sum Tree and Binary Indexed Tree | 190 | +| Interdisciplinary Reference 6.4 | Feature Engineering: One-Hot Coding and Tile Coding | 197 | +| Interdisciplinary Reference 8.1 | Optimization: MM Algorithm | 244 | +| Interdisciplinary Reference 8.2 | Optimization: Trust Region Method | 247 | +| Interdisciplinary Reference 8.3 | Information Theory: Kullback–Leibler Divergence | 248 | +| Interdisciplinary Reference 8.4 | Information Geometry: Fisher Information Matrix | 249 | +| Interdisciplinary Reference 8.5 | Information Geometry: Second-order Approximation of KL Divergence | 250 | +| Interdisciplinary Reference 8.6 | Numerical Linear Algebra: Conjugate Gradient | 254 | +| Interdisciplinary Reference 9.1 | Stochastic Process: Ornstein Uhlenbeck Process | 298 | +| Interdisciplinary Reference 10.1 | Information Theory: Entropy | 314 | +| Interdisciplinary Reference 12.1 | Probability Theory: Quantile Function | 368 | +| Interdisciplinary Reference 12.2 | Metric Geometry: Wasserstein Metric | 369 | +| Interdisciplinary Reference 12.3 | Utility Theory: von Neumann Morgenstern Utility | 372 | +| Interdisciplinary Reference 12.4 | Utility Theory: Yarri Utility | 373 | +| Interdisciplinary Reference 12.5 | Probably Theory: Categorical Distribution | 375 | +| Interdisciplinary Reference 12.6 | Machine Learning: Quantile Regression | 380 | +| Interdisciplinary Reference 13.1 | Machine Learning: Online Learning and Regret | 412 | +| Interdisciplinary Reference 13.2 | Probability Theory: Hoeffding's Inequality | 416 | +| Interdisciplinary Reference 13.3 | Probability Theory: Conjugate Distribution | 420 | +| Interdisciplinary Reference 13.4 | Asymptotic Complexity: $\tilde {O}$ Notation | 424 | +| Interdisciplinary Reference 14.1 | Board Game: Tic-Tac-Toe and Gomoku | 444 | +| Interdisciplinary Reference 14.2 | Board Game: Reversi | 445 | +| Interdisciplinary Reference 14.3 | Board Game: Go | 446 | +| Interdisciplinary Reference 14.4 | Combinatorial Game Theory: Game Tree | 448 | +| Interdisciplinary Reference 14.5 | Combinatorial Game Theory: Maximin and Minimax | 448 | +| Interdisciplinary Reference 14.6 | Deep Learning: Residual Network | 452 | +| Interdisciplinary Reference 15.1 | Stochastic Process: Properties of Markov Process | 481 | +| Interdisciplinary Reference 15.2 | Stochastic Process: Continuous-Time Markov Process | 486 | +| Interdisciplinary Reference 15.3 | Stochastic Process: Semi-Markov Process | 494 | +| Interdisciplinary Reference 15.4 | Stochastic Process: Hidden Markov Model | 499 | +| Interdisciplinary Reference 16.1 | Information Theory: $f$-Divergence | 532 | +| Interdisciplinary Reference 16.2 | Machine Learning: Generative Adversarial Network | 541 | +| Interdisciplinary Reference 16.3 | Machine Learning: Rank by Utility | 545 | +| Interdisciplinary Reference 16.4 | Rotational Kinematics: Principal Axes and Euler's Angles | 547 | diff --git a/en2024/setup/setupmac.md b/en2024/setup/setupmac.md index 0b809dc..0209003 100644 --- a/en2024/setup/setupmac.md +++ b/en2024/setup/setupmac.md @@ -10,7 +10,7 @@ This part will show how to set up a minimum environment. After this step, you ar **Steps:** -- Download the installer on https://www.anaconda.com/products/distribution (Pick MacOS Graphical Installer for MacOS users). The name of installer is alike `Anaconda3-2024.02-1-MacOSX-x86_64.pkg` (or `Anaconda3-2024.02-1-MacOSX-amd64.pkg` for M chip), and the size is about 0.6 GB. +- Download the installer on https://www.anaconda.com/products/distribution (Pick MacOS Graphical Installer for MacOS users). The name of installer is alike `Anaconda3-2024.06-1-MacOSX-x86_64.pkg` (or `Anaconda3-2024.06-1-MacOSX-amd64.pkg` for M chip), and the size is about 0.7 GB. - Double click the installer to start the install wizard and install accordingly. The free space of the disk should be at least 13GB. (If the free space of the disk is too little, you may still be able to install Anaconda 3 itself, but you may not have enough free space in the follow-up steps. 13GB is the storage requirements for all steps in this article.) Record the location of Anaconda installation. The default location is `/opt/anaconda3`. We will use the location in the sequal. #### Create a New Conda Environment diff --git a/en2024/setup/setupwin.md b/en2024/setup/setupwin.md index 90b3b6f..e58f0ee 100644 --- a/en2024/setup/setupwin.md +++ b/en2024/setup/setupwin.md @@ -10,7 +10,7 @@ This part will show how to set up a minimum environment. After this step, you ar **Steps:** -- Download the installer on https://www.anaconda.com/products/distribution (Pick Windows version for Windows users).The name of installer is alike `Anaconda3-2024.02-1-Windows-x86_64.exe`, and the size is about 0.9 GB. +- Download the installer on https://www.anaconda.com/products/distribution (Pick Windows version for Windows users).The name of installer is alike `Anaconda3-2024.06-1-Windows-x86_64.exe`, and the size is about 0.9 GB. - Double click the installer to start the install wizard and install accordingly. The free space of the disk should be at least 13GB. (If the free space of the disk is too little, you may still be able to install Anaconda 3 itself, but you may not have enough free space in the follow-up steps. 13GB is the storage requirements for all steps in this article except Visual Studio.) Record the location of Anaconda installation. The default location is `C:%HOMEPATH%\anaconda3`. We will use the location in the sequal. #### Create a New Conda Environment diff --git a/en2024/table.md b/en2024/table.md new file mode 100644 index 0000000..da4ee55 --- /dev/null +++ b/en2024/table.md @@ -0,0 +1,53 @@ +# List of Tables + +| \# | Caption | Page | +| :--- | :--- | ---: | +| Table 1.1 | Major Python modules that the codes in this book depend on. | 13 | +| Table 2.1 | Example initial state distribution in the task "Feed and Full". | 29 | +| Table 2.2 | Example dynamics in the task "Feed and Full". | 29 | +| Table 2.3 | Transition probability from state–action pair to the next state derived from Table 2.2. | 31 | +| Table 2.4 | Expected state–action reward derived from Table 2.2. | 32 | +| Table 2.5 | Expected reward from a state–action pair to the next state derived from Table 2.2. | 32 | +| Table 2.6 | An example policy in the task "Feed and Full". | 33 | +| Table 2.7 | Another example policy in the task "Feed and Full". | 33 | +| Table 2.8 | Alternative presentation of the deterministic policy in Table 2.7. | 33 | +| Table 2.9 | Initial state–action distribution derived from Tables 2.1 and 2.6. | 34 | +| Table 2.10 | Transition probability from a state to the next state derived from Tables 2.2 and 2.6. | 35 | +| Table 2.11 | Transition probability from a state–action pair to the next state–action pair derived from Tables 2.2 and 2.6. | 35 | +| Table 2.12 | Expected state reward derived from Tables 2.2 and 2.6. | 35 | +| Table 2.13 | State values derived from Tables 2.2 and 2.6. | 46 | +| Table 2.14 | Action values derived from Tables 2.2 and 2.6. | 46 | +| Table 2.15 | State values derived from Tables 2.2 and 2.7. | 47 | +| Table 2.16 | Discounted state visitation frequency derived from Tables 2.2 and 2.6. | 56 | +| Table 2.17 | Discounted state–action visitation frequency derived from Tables 2.2 and 2.6. | 57 | +| Table 2.18 | Optimal state values derived from Table 2.2. | 68 | +| Table 2.19 | Optimal action values derived from Table 2.2. | 68 | +| Table 4.1 | Value of the cards in Blackjack. | 124 | +| Table 5.1 | Taxi stands in the task `Taxi-v3`. | 161 | +| Table 5.2 | Actions in the task `Taxi-v3`. | 161 | +| Table 6.1 | Convergence of policy evaluation algorithms. | 182 | +| Table 6.2 | Convergence of policy optimization algorithms. | 183 | +| Table 6.3 | Tricks used by different algorithms. | 194 | +| Table 7.1 | Observations in the cart-pole problem. | 225 | +| Table 9.1 | Compare the task `Pendulum-v1`. | 301 | +| Table 10.1 | Actions in the task `LunarLander-v2`. | 334 | +| Table 10.2 | Actions in the task `LunarLanderContinuous-v2`. | 334 | +| Table 11.1 | Observations in the task `BipedalWalker-v3`. | 359 | +| Table 11.2 | Actions in the task `BipedalWalker-v3`. | 360 | +| Table 12.1 | Compare Categorical DQN, QR-DQN, and IQN. | 388 | +| Table 12.2 | Different versions of the Pong game. | 391 | +| Table 14.1 | Compare selection policy and decision policy. | 440 | +| Table 14.2 | Neural networks in MCTS. | 441 | +| Table 14.3 | Some two-player zero-sum deterministic sequential board games. | 445 | +| Table 14.4 | 8 equivalent boards. | 451 | +| Table 14.5 | Some DRL algorithms that apply MCTS to board games. | 454 | +| Table 14.6 | APIs for environment dynamics. | 458 | +| Table 15.1 | Initial probability in the task "Tiger". We can delete the reward row since the initial reward is trivial. | 501 | +| Table 15.2 | Initial emission probability in the task "Tiger". We can delete this table since the initial observation is trivial. | 501 | +| Table 15.3 | Dynamics in the task "Tiger". We can delete the reward column if rewards are not considered. | 501 | +| Table 15.4 | Observation probability in the task "Tiger". | 502 | +| Table 15.5 | Conditional probability $\omega {\left ({\mathsfit {s^\prime },\mathsfit {o}}\middle \vert {{b,\mathsfit {a}}}\right )}$ in the task "Tiger". | 504 | +| Table 15.6 | Conditional probability $\omega {\left ({\mathsfit {o}}\middle \vert {{b,\mathsfit {a}}}\right )}$ in the task "Tiger". | 505 | +| Table 15.7 | Belief updates in the task "Tiger". | 505 | +| Table 15.8 | $r{{\left ({{b,\mathsfit {a}}}\right )}}$ in the task "Tiger". | 506 | +| Table 15.9 | Discounted optimal values and optimal policy of the task "Tiger". | 511 | diff --git a/zh2023/setup/setupmac.md b/zh2023/setup/setupmac.md index 017924e..f84e232 100644 --- a/zh2023/setup/setupmac.md +++ b/zh2023/setup/setupmac.md @@ -10,7 +10,7 @@ **步骤:** -- 从https://www.anaconda.com/products/distribution 下载Anaconda 3安装包(选择MacOS Graphical版的安装包)。安装包名字像 `Anaconda3-2024.02-1-MacOSX-x86_64.pkg`(M芯片版安装包名字像`Anaconda3-2024.02-1-MacOSX-amd64.pkg`),大小约0.6 GB。 +- 从https://www.anaconda.com/products/distribution 下载Anaconda 3安装包(选择MacOS Graphical版的安装包)。安装包名字像 `Anaconda3-2024.06-1-MacOSX-x86_64.pkg`(M芯片版安装包名字像`Anaconda3-2024.06-1-MacOSX-amd64.pkg`),大小约0.7 GB。 - 双击安装包启动安装向导完成安装。需要安装在剩余空间大于13GB的硬盘上。(如果空间小于这个数,虽然也能完成Anaconda 3的安装,但是后续步骤的空间就不够了。13GB是后续所有步骤需要的空间。)安装过程中记下Anaconda的安装路径。默认路径为:`/opt/anaconda3`。后续操作会用到这个路径。 #### 新建conda环境 diff --git a/zh2023/setup/setupwin.md b/zh2023/setup/setupwin.md index 2bde264..6637c3d 100644 --- a/zh2023/setup/setupwin.md +++ b/zh2023/setup/setupwin.md @@ -10,7 +10,7 @@ **步骤:** -- 从https://www.anaconda.com/products/distribution 下载Anaconda 3安装包(选择Windows版的安装包)。安装包名字像 `Anaconda3-2024.02-1-Windows-x86_64.exe`,大小约0.9GB。 +- 从https://www.anaconda.com/products/distribution 下载Anaconda 3安装包(选择Windows版的安装包)。安装包名字像 `Anaconda3-2024.06-1-Windows-x86_64.exe`,大小约0.9GB。 - 双击安装包启动安装向导完成安装。需要安装在剩余空间大于13GB的硬盘上。(如果空间小于这个数,虽然也能完成Anaconda 3的安装,但是后续步骤的空间就不够了。13GB是后续所有步骤(除了安装Visual Studio以外)需要的空间。)安装过程中记下Anaconda的安装路径。默认路径为:`C:%HOMEPATH%\anaconda3`。后续操作会用到这个路径。 #### 新建conda环境