I have some questions about the training process. #67
Replies: 3 comments
-
The content of "Step-by-step Curriculum Fine-tuning" was written in the initial stages of this repository. However, after further experimentation, it has become clear that the content of this section does not align with reality. In the future, this section is planned to be either deleted or revised. Regarding the transition from BC to offline RL that you are inquiring about, I have not yet fully documented it, but generally, it is implemented to function as described below.
All models implemented in this repository are designed to be separable into encoder and decoder parts. The encoder consists of stacked transformer encoder layers, while the decoder is a very thin single or multilayer perceptron. This structure is expected to encode most of the basic knowledge about Mahjong acquired during the learning process in the encoder part. Moreover, while the decoder part may vary between models and needs to be relearned for each, the encoder part is common across all models if the encoder parameters are all the same. This allows the encoder part trained in one model to be used as the initial values for training another model. As a result, the learning outcomes from one model can be transferred to another model's training. In the transition from BC to offline RL, using the encoder part of the BC-trained model as the initial values for the encoder part of the offline RL model (CQL or IQL) can significantly accelerate the learning at the very beginning of the offline RL training.
When transferring the pretrained encoder part using the above method, the only parameters that need to be consistent are those of the encoder. There is no need for the other parameters to be consistent with those before the transfer. Conversely, this means that, just like in normal training, there is a need to search for the optimal parameters other than encoder's ones.
This really depends on the case. At least in the initial stages of training, it is certain that using a pretrained encoder as the initial value speeds up the learning process. However, which method results in higher final accuracy can only be determined through experimentation.
Honestly, there are numerous possible causes, and I simply do not have enough experimental resources to identify them, so I can hardly provide any definitive answers. A general method for identifying problems includes observing behaviors such as the loss and gradient norm displayed on TensorBoard. It is also important to refer to the original papers of each model and understand the characteristics and behaviors of each model, which can be crucial for pinpointing issues. |
Beta Was this translation helpful? Give feedback.
-
I understand, thank you for your detailed reply. However, I have another question. In annotate4rl, only under specific circumstances will the round result and change score be filled as features. What considerations are this decision based on? |
Beta Was this translation helpful? Give feedback.
-
The rationale behind this design decision is that the reward function r is a function of a certain state s, the action a taken in state s, and the subsequent state s' that results from action a; in other words, it is in the form of r(s, a, s'). Conversely, this means that anything that cannot be calculated from s, a, and s' should not be considered as a reward. For example, if Nevertheless, I also believe that strictly limiting the format of the reward function r to r(s, a, s') might be too restrictive. For instance, in Suphx, the reward for a round is calculated based on the difference between the global reward prediction (GRP) at the start of the round and at the start of the next round, which cannot be computed in the r(s, a, s') format. I plan to improve this in the next major version. If you have any ideas for rewards that cannot be calculated in the r(s, a, s') format, I would greatly appreciate it if you could share them with me. I would like to consider them in the design of the next major version. |
Beta Was this translation helpful? Give feedback.
-
Hi Cryolite,
I'm very interested in "Step-by-step Curriculum Fine-tuning", but I'm having a bit of difficulty understanding it.
Could you further explain it to me?
Thank you for taking your valuable time. I look forward to any response from you!
Beta Was this translation helpful? Give feedback.
All reactions