I have some questions about the training process. #67

constellation39 · 2024-04-18T02:42:41Z

constellation39
Apr 18, 2024

Hi Cryolite,

I'm very interested in "Step-by-step Curriculum Fine-tuning", but I'm having a bit of difficulty understanding it.

Could you further explain it to me?

Are there any key points to note when using bc to transition to offline_rl for some of the most basic training paths?
After changing the training module, do the parameters need to be deliberately kept consistent?
Is it advisable to directly use default initialized weights to train cql or iql?
If the model performance declines after using bc for offline_rl, how should I identify the issue?

Thank you for taking your valuable time. I look forward to any response from you!

Cryolite · 2024-04-18T06:36:50Z

Cryolite
Apr 18, 2024
Maintainer

The content of "Step-by-step Curriculum Fine-tuning" was written in the initial stages of this repository. However, after further experimentation, it has become clear that the content of this section does not align with reality. In the future, this section is planned to be either deleted or revised.

Regarding the transition from BC to offline RL that you are inquiring about, I have not yet fully documented it, but generally, it is implemented to function as described below.

Are there any key points to note when using bc to transition to offline_rl for some of the most basic training paths?

All models implemented in this repository are designed to be separable into encoder and decoder parts. The encoder consists of stacked transformer encoder layers, while the decoder is a very thin single or multilayer perceptron. This structure is expected to encode most of the basic knowledge about Mahjong acquired during the learning process in the encoder part. Moreover, while the decoder part may vary between models and needs to be relearned for each, the encoder part is common across all models if the encoder parameters are all the same. This allows the encoder part trained in one model to be used as the initial values for training another model. As a result, the learning outcomes from one model can be transferred to another model's training. In the transition from BC to offline RL, using the encoder part of the BC-trained model as the initial values for the encoder part of the offline RL model (CQL or IQL) can significantly accelerate the learning at the very beginning of the offline RL training.

After changing the training module, do the parameters need to be deliberately kept consistent?

When transferring the pretrained encoder part using the above method, the only parameters that need to be consistent are those of the encoder. There is no need for the other parameters to be consistent with those before the transfer. Conversely, this means that, just like in normal training, there is a need to search for the optimal parameters other than encoder's ones.

Is it advisable to directly use default initialized weights to train cql or iql?

This really depends on the case. At least in the initial stages of training, it is certain that using a pretrained encoder as the initial value speeds up the learning process. However, which method results in higher final accuracy can only be determined through experimentation.

If the model performance declines after using bc for offline_rl, how should I identify the issue?

Honestly, there are numerous possible causes, and I simply do not have enough experimental resources to identify them, so I can hardly provide any definitive answers. A general method for identifying problems includes observing behaviors such as the loss and gradient norm displayed on TensorBoard. It is also important to refer to the original papers of each model and understand the characteristics and behaviors of each model, which can be crucial for pinpointing issues.

0 replies

constellation39 · 2024-04-18T06:57:32Z

constellation39
Apr 18, 2024
Author

I understand, thank you for your detailed reply.

However, I have another question. In annotate4rl, only under specific circumstances will the round result and change score be filled as features. What considerations are this decision based on?

0 replies

Cryolite · 2024-04-19T02:10:32Z

Cryolite
Apr 19, 2024
Maintainer

The rationale behind this design decision is that the reward function r is a function of a certain state s, the action a taken in state s, and the subsequent state s' that results from action a; in other words, it is in the form of r(s, a, s'). Conversely, this means that anything that cannot be calculated from s, a, and s' should not be considered as a reward.

For example, if annotate4rl were to inadvertently record information about score changes during a round at times other than the end of each round, users writing reward plug-ins could mistakenly use that information as a reward for score changes during those times, not just at the end of the round. This would violate the constraint that the reward function r should be a function of s, a, and s', specifically r(s, a, s'), and lead to a design prone to errors. According to the principles of reinforcement learning, if score changes occur after more than two actions, the reward from those score changes must be discounted and attenuated by the discount factor γ.

Nevertheless, I also believe that strictly limiting the format of the reward function r to r(s, a, s') might be too restrictive. For instance, in Suphx, the reward for a round is calculated based on the difference between the global reward prediction (GRP) at the start of the round and at the start of the next round, which cannot be computed in the r(s, a, s') format. I plan to improve this in the next major version. If you have any ideas for rewards that cannot be calculated in the r(s, a, s') format, I would greatly appreciate it if you could share them with me. I would like to consider them in the design of the next major version.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I have some questions about the training process. #67

{{title}}

Replies: 3 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

I have some questions about the training process. #67

constellation39 Apr 18, 2024

Replies: 3 comments

Cryolite Apr 18, 2024 Maintainer

constellation39 Apr 18, 2024 Author

Cryolite Apr 19, 2024 Maintainer

constellation39
Apr 18, 2024

Cryolite
Apr 18, 2024
Maintainer

constellation39
Apr 18, 2024
Author

Cryolite
Apr 19, 2024
Maintainer