Update codes

ZhiqingXiao · Oct 6, 2024 · 273889e · 273889e
1 parent cd3eff4
commit 273889e
Show file tree

Hide file tree

Showing 11 changed files with 294 additions and 27 deletions.
diff --git a/README.md b/README.md
@@ -21,6 +21,9 @@ This is a tutorial book on reinforcement learning, with explanation of theory an
 
 Check [here](https://github.com/ZhiqingXiao/rl-book/tree/master/en2024) for codes, exercise answers, etc.
 
+Check [SpringerLink](https://doi.org/10.1007/978-981-19-4933-3) [Amazon](https://www.amazon.com/dp/9811949328) for book contents.
+
+
 ### Table of Codes
 
 All codes have been saved as a .ipynb file and a .html file in the same directory.
@@ -46,7 +49,7 @@ All codes have been saved as a .ipynb file and a .html file in the same director
 | 16 | [HumanoidBulletEnv-v0](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_ClosedForm_demo.html) | BehaviorClone [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_BC_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_BC_torch.html), GAIL [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_GAILPPO_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_GAILPPO_torch.html) |
 
 
-# 强化学习：原理与Python实战
+# 强化学习：原理与Python实战 (2023 中文版)
 
 **全球第一本配套 TensorFlow 2 和 PyTorch 1/2 对照代码的强化学习教程书**
 
@@ -89,7 +92,7 @@ All codes have been saved as a .ipynb file and a .html file in the same director
 
 本书介绍强化学习理论及其 Python 实现。
 - 理论完备：全书用一套完整的数学体系，严谨地讲授强化学习的理论基础，主要定理均给出证明过程。各章内容循序渐进，覆盖了所有主流强化学习算法，包括资格迹等非深度强化学习算法和柔性执行者/评论者等深度强化学习算法。
-- 案例丰富：在您最爱的操作系统（包括 Windows、macOS、Linux）上，基于 Python 3、Gym 0.26 和 TensorFlow 2 + PyTorch 1/2，实现强化学习算法。全书实现统一规范，体积小、重量轻。第 1～9 章给出了算法的配套实现，环境部分只依赖于 Gym 的最小安装，在没有 GPU 的计算机上也可运行；第 10～12 章介绍了多个热门综合案例，涵盖 Gym 的完整安装和自定义扩展，在有普通 GPU 的计算机上即可运行。
+- 案例丰富：在您最爱的操作系统（包括 Windows、macOS、Linux）上，基于 Python 3、Gym 0.26 和 TensorFlow 2，实现强化学习算法。全书实现统一规范，体积小、重量轻。第 1～9 章给出了算法的配套实现，环境部分只依赖于 Gym 的最小安装，在没有 GPU 的计算机上也可运行；第 10～12 章介绍了多个热门综合案例，涵盖 Gym 的完整安装和自定义扩展，在有普通 GPU 的计算机上即可运行。
 
 **QQ群**
 

diff --git a/en2024/README.md b/en2024/README.md
@@ -1,14 +1,16 @@
 # Reinforcement Learning: Theory and Python Implementation
 
-**The First Reinforcement Learning Tutorial Book with one-on-one mapping TensorFlow 2 and PyTorch 1/2 Implementation**
+**The First Reinforcement Learning Tutorial Book in English with one-on-one mapping TensorFlow 2 and PyTorch 1/2 Implementation**
+
+**Cover RL algorithms for large models such as PPO, RLHF, IRL, and PbRL**
 
 Please email me if you are interested in publishing this book in other languages.
 
 ### Features
 
 This book comprehensively introduces the mainstream RL theory.
 
-- This book introduces the trunk of the modern RL theory in a systematically way. All major results are accompanied with proofs. We introduce the algorithms based on the theory, which covers all mainstream RL algorithms, including both classical RL algorithms such as eligibility trace and deep RL algorithm such as MuZero.
+- This book introduces the trunk of the modern RL theory in a systematically way. All major results are accompanied with proofs. We introduce the algorithms based on the theory, which covers all mainstream RL algorithms, including the algorithms in large model era such as PPO, RLHF, IRL, and PbRL.
 - This book uses a consistent set of mathematical notations, which are compatible with mainstream RL tutorials.
 
 All chapters are accompanied with Python codes.
@@ -22,29 +24,33 @@ All chapters are accompanied with Python codes.
 
 ### Table of Contents
 
-01. Introduction of Reinforcement Learning
-02. Markov Decision Process
-03. Model-based Numeric Iteration
-04. MC: Monte-Carlo Learning
-05. TD: Temporal Difference Learning
-06. Function Approximation
-07. PG: Policy Gradient
-08. AC: Actor-Critic
-09. DPG: Deterministic Policy Gradient
-10. Maximum-Entropy RL
-11. Policy-based Gradient-Free Algorithms
-12. Distributional RL
-13. Minimize Regret
-14. Tree Search
-15. More Agent-Environment Interface
-16. Learning from Feedback and Imitation Learning
+01. Introduction of Reinforcement Learning [view](https://doi.org/10.1007/978-981-19-4933-3_1)
+02. Markov Decision Process [view](https://doi.org/10.1007/978-981-19-4933-3_2)
+03. Model-based Numeric Iteration [view](https://doi.org/10.1007/978-981-19-4933-3_3)
+04. MC: Monte-Carlo Learning [view](https://doi.org/10.1007/978-981-19-4933-3_4)
+05. TD: Temporal Difference Learning [view](https://doi.org/10.1007/978-981-19-4933-3_5)
+06. Function Approximation [view](https://doi.org/10.1007/978-981-19-4933-3_6)
+07. PG: Policy Gradient [view](https://doi.org/10.1007/978-981-19-4933-3_7)
+08. AC: Actor-Critic [view](https://doi.org/10.1007/978-981-19-4933-3_8)
+09. DPG: Deterministic Policy Gradient [view](https://doi.org/10.1007/978-981-19-4933-3_9)
+10. Maximum-Entropy RL [view](https://doi.org/10.1007/978-981-19-4933-3_10)
+11. Policy-based Gradient-Free Algorithms [view](https://doi.org/10.1007/978-981-19-4933-3_11)
+12. Distributional RL [view](https://doi.org/10.1007/978-981-19-4933-3_12)
+13. Minimize Regret [view](https://doi.org/10.1007/978-981-19-4933-3_13)
+14. Tree Search [view](https://doi.org/10.1007/978-981-19-4933-3_14)
+15. More Agent-Environment Interface [view](https://doi.org/10.1007/978-981-19-4933-3_15)
+16. Learning from Feedback and Imitation Learning [view](https://doi.org/10.1007/978-981-19-4933-3_16)
 
 ### Resources
 
 - Reference answers of multiple choices: [link](https://zhiqingxiao.github.io/rl-book/en2024/choice.html)
 - Guide to set up developing environment: [Windows](https://github.com/ZhiqingXiao/rl-book/blob/master/en2024/setup/setupwin.md) [macOS](https://github.com/ZhiqingXiao/rl-book/blob/master/en2024/setup/setupmac.md)
 - Table of notations: [link](https://github.com/ZhiqingXiao/rl-book/blob/master/en2024/notation.md)
 - Table of abbreviations: [link](https://github.com/zhiqingxiao/rl-book/blob/master/en2024/abbreviation.md)
+- Table of algorithms: [link](https://github.com/zhiqingxiao/rl-book/blob/master/en2024/algo.md)
+- Table of interdisciplinary references: [link](https://github.com/zhiqingxiao/rl-book/blob/master/en2024/iref.md)
+- Table of figures: [link](https://github.com/zhiqingxiao/rl-book/blob/master/en2024/figure.md)
+- Table of tables: [link](https://github.com/zhiqingxiao/rl-book/blob/master/en2024/table.md)
 - Gym Internal: [link](https://github.com/ZhiqingXiao/rl-book/blob/master/en2024/gym.md)
 - Bibliography: [link](https://github.com/zhiqingxiao/rl-book/blob/master/en2024/bibliography.md)
 
@@ -83,17 +89,20 @@ Note:
      title     = {Reinforcement Learning: Theory and {Python} Implementation},
      author    = {Zhiqing Xiao}
      year      = 2024,
-     publisher = {Springer Nature},
+     month     = 9,
+     publisher = {Springer},
     }
 
 # 强化学习：原理与Python实现（英文版）
 
-**全球第一本配套 TensorFlow 2 和 PyTorch 1/2 一比一对照代码的强化学习书**
+**全球第一本配套 TensorFlow 2 和 PyTorch 1/2 一比一对照代码的英文强化学习书**
+
+**解密大模型训练技术 PPO、RLHF、IRL和PbRL**
 
 ### 本书特色
 
 本书完整地介绍了主流强化学习理论。
-- 选用现代强化学习理论体系，突出主干，主要定理均给出证明过程。基于理论讲解强化学习算法，全面覆盖主流强化学习算法，包括了资格迹等经典算法和MuZero等深度强化学习算法。
+- 选用现代强化学习理论体系，突出主干，主要定理均给出证明过程。基于理论讲解强化学习算法，全面覆盖主流强化学习算法，包括了资格迹等经典算法和MuZero等深度强化学习算法。涵盖大模型时代的常用算法如PPO、RLHF、IRL、PbRL等。
 - 全书采用完整的数学体系，各章内容循序渐进。全书采用一致的数学符号，并兼容主流强化学习教程。
 
 本书各章均提供Python代码，实战性强。

diff --git a/en2024/algo.md b/en2024/algo.md
@@ -0,0 +1,88 @@
+# List of Algorithms
+
+| \# | Caption | Page |
+| :--- | :--- | ---: |
+| Algorithm 2.1 | Check whether the policy is optimal. | 49 |
+| Algorithm 2.2 | Policy improvement. | 50 |
+| Algorithm 3.1 | Model-based numerical iterative policy evaluation to estimate state values. | 88 |
+| Algorithm 3.2 | Model-based numerical iterative policy evaluation to estimate action values. | 89 |
+| Algorithm 3.3 | Model-based numerical iterative policy evaluation to estimate action values (space-saving version). | 90 |
+| Algorithm 3.4 | Model-based numerical iterative policy evaluation (space-saving version, alternative implementation). | 90 |
+| Algorithm 3.5 | Model-based policy iteration. | 92 |
+| Algorithm 3.6 | Model-based policy iteration (space-saving version). | 92 |
+| Algorithm 3.7 | Model-based VI. | 93 |
+| Algorithm 3.8 | Model-based VI (space-saving version). | 93 |
+| Algorithm 4.1 | Evaluate action values using every-visit MC policy evaluation. | 110 |
+| Algorithm 4.2 | Every-visit MC update to evaluate state values. | 110 |
+| Algorithm 4.3 | First-visit MC update to estimate action values. | 111 |
+| Algorithm 4.4 | First-visit MC update to estimate state values. | 112 |
+| Algorithm 4.5 | MC update with exploring start (maintaining policy explicitly). | 113 |
+| Algorithm 4.6 | MC update with exploring start (maintaining policy implicitly). | 114 |
+| Algorithm 4.7 | MC update with soft policy (maintaining policy explicitly). | 116 |
+| Algorithm 4.8 | MC update with soft policy (maintaining policy explicitly). | 117 |
+| Algorithm 4.9 | Evaluate action values using off-policy MC update based on importance sampling. | 121 |
+| Algorithm 4.10 | Find an optimal policy using off-policy MC update based on importance sampling. | 123 |
+| Algorithm 5.1 | One-step TD policy evaluation to estimate action values. | 139 |
+| Algorithm 5.2 | One-step TD policy evaluation to estimate action values with an indicator of episode end. | 140 |
+| Algorithm 5.3 | One-step TD policy evaluation to estimate state values. | 141 |
+| Algorithm 5.4 | $n$-step TD policy evaluation to estimate action values. | 142 |
+| Algorithm 5.5 | $n$-step TD policy evaluation to estimate state values. | 143 |
+| Algorithm 5.6 | SARSA (maintaining the policy explicitly). | 144 |
+| Algorithm 5.7 | SARSA (maintaining the policy implicitly). | 145 |
+| Algorithm 5.8 | $n$-step SARSA. | 146 |
+| Algorithm 5.9 | Expected SARSA. | 147 |
+| Algorithm 5.10 | $n$-step expected SARSA. | 148 |
+| Algorithm 5.11 | $n$-step TD policy evaluation of SARSA with importance sampling. | 150 |
+| Algorithm 5.12 | Q learning. | 152 |
+| Algorithm 5.13 | Double Q Learning. | 154 |
+| Algorithm 5.14 | TD ${\left ({\lambda }\right )}$ policy evaluation or SARSA ${\left ({\lambda }\right )}$. | 158 |
+| Algorithm 5.15 | TD ${\left ({\lambda }\right )}$ policy evaluation to estimate state values. | 159 |
+| Algorithm 6.1 | Policy evaluation with function approximation and SGD. | 177 |
+| Algorithm 6.2 | Policy optimization with function approximation and SGD. | 177 |
+| Algorithm 6.3 | Semi-gradient descent policy evaluation to estimate action values or SARSA policy optimization. | 178 |
+| Algorithm 6.4 | Semi-gradient descent policy evaluation to estimate state values, or expected SARSA policy optimization, or Q learning. | 179 |
+| Algorithm 6.5 | TD ${\left ({\lambda }\right )}$ policy evaluation for action values or SARSA. | 181 |
+| Algorithm 6.6 | TD ${\left ({\lambda }\right )}$ policy evaluation for state values, or expected SARSA, or Q learning. | 181 |
+| Algorithm 6.7 | DQN policy optimization with experience replay (loop over episodes). | 187 |
+| Algorithm 6.8 | DQN policy optimization with experience replay (without looping over episodes explicitly). | 188 |
+| Algorithm 6.9 | DQN with experience replay and target network. | 191 |
+| Algorithm 7.1 | VPG policy optimization. | 220 |
+| Algorithm 7.2 | VPG policy optimization with baseline. | 222 |
+| Algorithm 7.3 | Importance sampling PG policy optimization. | 223 |
+| Algorithm 8.1 | Action-value on-policy AC. | 239 |
+| Algorithm 8.2 | Advantage AC. | 240 |
+| Algorithm 8.3 | A3C (one-step TD version, showing the behavior of one worker). | 240 |
+| Algorithm 8.4 | Advantage AC with eligibility trace. | 242 |
+| Algorithm 8.5 | Clipped PPO (simplified version). | 246 |
+| Algorithm 8.6 | Clipped PPO (with on-policy experience replay). | 246 |
+| Algorithm 8.7 | Vanilla NPG. | 253 |
+| Algorithm 8.8 | CG. | 255 |
+| Algorithm 8.9 | NPG with CG. | 255 |
+| Algorithm 8.10 | TRPO. | 257 |
+| Algorithm 8.11 | OffPAC. | 258 |
+| Algorithm 9.1 | Vanilla on-policy deterministic AC. | 292 |
+| Algorithm 9.2 | OPDAC. | 294 |
+| Algorithm 9.3 | DDPG. | 295 |
+| Algorithm 9.4 | TD3. | 297 |
+| Algorithm 10.1 | SQL. | 326 |
+| Algorithm 10.2 | SAC. | 328 |
+| Algorithm 10.3 | SAC with automatic entropy adjustment. | 331 |
+| Algorithm 11.1 | ES. | 356 |
+| Algorithm 11.2 | ARS. | 358 |
+| Algorithm 12.1 | Categorical DQN to find the optimal policy (to maximize expectation). | 377 |
+| Algorithm 12.2 | Categorical DQN to find the optimal policy (to maximize VNM utility). | 378 |
+| Algorithm 12.3 | QR-DQN to Find the Optimal Policy (To Maximize Expectation). | 382 |
+| Algorithm 12.4 | IQN to find the optimal policy (to maximize expectation). | 384 |
+| Algorithm 12.5 | Categorical DQN to find the optimal policy (use Yarri distortion function). | 387 |
+| Algorithm 13.1 | $\varepsilon $-greedy. | 414 |
+| Algorithm 13.2 | UCB (including UCB1). | 415 |
+| Algorithm 13.3 | Bayesian UCB. | 421 |
+| Algorithm 13.4 | Thompson Sampling. | 422 |
+| Algorithm 13.5 | UCBVI. | 423 |
+| Algorithm 14.1 | MCTS. | 435 |
+| Algorithm 14.2 | AlphaZero. | 441 |
+| Algorithm 14.3 | MuZero. | 442 |
+| Algorithm 15.1 | Semi-gradient descent policy evaluation to estimate action values or SARSA policy optimization. | 484 |
+| Algorithm 15.2 | Semi-gradient descent differential expected SARSA policy optimization, or differential Q learning. | 485 |
+| Algorithm 15.3 | Model-based VI for fixed-horizontal episode. | 492 |
+| Algorithm 16.1 | GAIL-PPO. | 543 |
diff --git a/en2024/cover.jpg b/en2024/cover.jpg
diff --git a/en2024/figure.md b/en2024/figure.md
@@ -0,0 +1,62 @@
+# List of Figures
+
+| \# | Caption | Page |
+| :--- | :--- | ---: |
+| Figure 1.1 | Robot in a maze. | 2 |
+| Figure 1.2 | PacMan in Atari 2600. | 4 |
+| Figure 1.3 | A record for a game of Go. | 4 |
+| Figure 1.4 | Bipedal walker. | 5 |
+| Figure 1.5 | Large language models. | 5 |
+| Figure 1.6 | Agent–environment interface. | 5 |
+| Figure 1.7 | Taxonomy of RL. | 8 |
+| Figure 1.8 | Relationship among RL, DL, and DRL. | 11 |
+| Figure 2.1 | State transition graph of the example. | 26 |
+| Figure 2.2 | Compare trajectories of DTMP, DTMRP, and DTMDP. | 28 |
+| Figure 2.3 | State transition graph of the example "Feed and Full". | 29 |
+| Figure 2.4 | Backup diagram that state values and action values represent each other. | 40 |
+| Figure 2.5 | State values and action values back up themselves. | 42 |
+| Figure 2.6 | Backup diagram for optimal state values and optimal action values backing up each other. | 64 |
+| Figure 2.7 | Backup diagram for optimal state values and optimal action values backing up themselves. | 65 |
+| Figure 2.8 | Grid of the task `CliffWalking-v0`. | 72 |
+| Figure 3.1 | Policy improvement. | 91 |
+| Figure 3.2 | Illustration of bootstrap. | 95 |
+| Figure 4.1 | An example task of Monte Carlo. | 106 |
+| Figure 4.2 | An example where the optimal policy may not be found without exploring start. | 113 |
+| Figure 4.3 | State value estimates obtained by policy evaluation algorithm. | 128 |
+| Figure 4.4 | Optimal policy estimates. | 129 |
+| Figure 4.5 | Optimal state value estimates. | 130 |
+| Figure 5.1 | Backup diagram of TD return and MC return. | 138 |
+| Figure 5.2 | Maximization bias in Q learning. | 153 |
+| Figure 5.3 | Backup diagram of $\lambda$ return. | 156 |
+| Figure 5.4 | Compare different eligibility traces. | 158 |
+| Figure 5.5 | ASCII map of the task `Taxi-v3`. | 160 |
+| Figure 6.1 | MDP in Baird's counterexample. | 184 |
+| Figure 6.2 | Trend of parameters with iterations. | 185 |
+| Figure 6.3 | The task `MountainCar-v0`. | 195 |
+| Figure 6.4 | Position and velocity of the car when it is always pushed right. | 196 |
+| Figure 6.5 | One-hot coding and tile coding. | 197 |
+| Figure 7.1 | The cart-pole problem. | 224 |
+| Figure 8.1 | Illustration of MM algorithm. | 244 |
+| Figure 8.2 | Relationship among $g_{\pi\left({\mathbf\uptheta}\right)}$, $l\left({\mathbf\uptheta}\middle\vert{\mathbf\uptheta_k}\right)$, and $l_c\left({\mathbf\uptheta}\middle\vert{\mathbf\uptheta_k}\right)$. | 252 |
+| Figure 8.3 | The task `Acrobot-v1`. | 259 |
+| Figure 9.1 | The task `Pendulum-v1`. | 300 |
+| Figure 12.1 | Some Atari games. | 390 |
+| Figure 12.2 | Neural network for Categorical DQN. | 398 |
+| Figure 12.3 | Neural network for IQN. | 403 |
+| Figure 14.1 | Search tree. | 434 |
+| Figure 14.2 | Steps of MCTS. | 436 |
+| Figure 14.3 | First two steps of the reversi opening "Chimney". | 446 |
+| Figure 14.4 | Game tree of Tic-Tac-Toe. | 448 |
+| Figure 14.5 | Maximin decision of Tic-Tac-Toe. | 449 |
+| Figure 14.6 | MCTS with self-play. | 450 |
+| Figure 14.7 | Reverse the color of all pieces on the board. | 452 |
+| Figure 14.9 | Residual network. | 452 |
+| Figure 14.8 | Example structure prediction network for the game of Go. | 453 |
+| Figure 15.1 | MDP of the task "Tiger". | 501 |
+| Figure 15.2 | Trajectories maintained by the environment and the agent. | 503 |
+| Figure 15.3 | Belief MDP of the task "Tiger". | 507 |
+| Figure 16.1 | Learning from feedbacks. | 526 |
+| Figure 16.2 | Agent–environment interface of IL. | 531 |
+| Figure 16.3 | Compounding error of imitation policy. | 541 |
+| Figure 16.4 | Training GPT. | 545 |
+| Figure 16.5 | Principal axes and Euler's angles. | 548 |