Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a bug found in save_model of LOMOTrainer #54

Open
DingQiang2018 opened this issue Aug 30, 2023 · 10 comments
Open

a bug found in save_model of LOMOTrainer #54

DingQiang2018 opened this issue Aug 30, 2023 · 10 comments
Assignees

Comments

@DingQiang2018
Copy link

DingQiang2018 commented Aug 30, 2023

我使用lomo(和zero3)在8张NVIDIA 3090 GPU上微调chatglm2-6b,并使用LOMOTrainer的save_model方法保存。重新加载模型checkpoint后,我发现模型测出来的验证集loss与训练结束时测出来的不一样。我参考deepspeed官方保存模型的代码,重写了save_model(重写的代码如下),发现这个bug解决了。这说明原来版本的save_model有bug,但我还没有找到具体出错原因。
I used LOMO (and zero3) to fine-tune chatglm2-6b on 8 NVIDIA 3090 GPUs and saved it using LOMOTrainer's save_model method. After reloading the model checkpoint, I found that the validation loss measured by the model differed from the validation loss measured at the end of training. I referred to the DeepSpeed official code, rewrote save_model (rewritten code below), and found this bug resolved. This indicates that the original version of save_model has a bug, but I have not yet figured out the specific cause of the error.

    def save_model(self, index):
        if self.training_args.local_rank in [-1, 0]:
            checkpoint_dir = sorted(Path(self.training_args.output_dir).glob("checkpoint-*"))
            if len(checkpoint_dir) >= self.training_args.save_total_limit:
                shutil.rmtree(checkpoint_dir[0], ignore_errors=True)
        torch.distributed.barrier()

        if self.training_args.resume_step:
            output_dir = os.path.join(self.training_args.output_dir, f"checkpoint-{index+self.training_args.resume_step}")
        else:
            output_dir = os.path.join(self.training_args.output_dir, f"checkpoint-{index}")
        if not os.path.exists(output_dir):
            os.makedirs(output_dir, exist_ok=True)

        state_dict = OrderedDict() if torch.distributed.get_rank() == 0 else None
        shared_params = {}

        # Prepare for checkpoint save by ensuring all parameters are partitioned
        self.model.optimizer.partition_all_parameters()

        with deepspeed.zero.GatheredParameters(list(self.model.module.parameters()), modifier_rank=0):
            if torch.distributed.get_rank() == 0:
                for name, param in self.model.module.named_parameters():
                    if param is None:
                        continue
                    # can't rely on param.data_ptr() as it will be reused as weights gets
                    # gathered and reduced, but param.ds_id is unique across all zero weights
                    # (and shared params will have the same param.ds_id)
                    if param.ds_id in shared_params:
                        # shared weights
                        #print(f"`{key}` is shared with `{shared_params[param.ds_id]}`")
                        state_dict[name] = state_dict[shared_params[param.ds_id]]
                    else:
                        state_dict[name] = param.detach().cpu()
                        shared_params[param.ds_id] = name
                    #print(f"param {param.ds_id} {param.shape} {key} ")

                # now buffers - not sure if need to take care of potentially shared weights here
                for name, buf in self.model.module.named_buffers():
                    if (buf is not None and name not in self.model.module._non_persistent_buffers_set):
                        state_dict[name] = buf.detach().cpu()

        if len(self.model.optimizer.persistent_parameters) > 0:
            self.model.optimizer.persistent_parameters[0].all_gather(self.model.optimizer.persistent_parameters)

        if torch.distributed.get_rank() == 0:
            torch.save(state_dict, os.path.join(output_dir, 'pytorch_model.bin'))

        torch.distributed.barrier()
@KaiLv69
Copy link
Collaborator

KaiLv69 commented Aug 31, 2023

Thanks for your kind feedback and save_model() has been updated according to your advice. FYI: 06e50c0

@DingQiang2018
Copy link
Author

DingQiang2018 commented Aug 31, 2023

很荣幸我的建议被采纳。我还想问问您对之前save_model出错的具体原因有什么看法吗?我还没想通。
It's my pleasure to see my advice being accepted. Moreover, may I ask you if you have any comment on why save_model went wrong previously? I have not figured out this.

@DingQiang2018
Copy link
Author

很荣幸我的建议被采纳。我还想问问您对之前save_model出错的具体原因有什么看法吗?我还没想通。 It's my pleasure to see my advice being accepted. Moreover, may I ask you if you have any comment on why save_model went wrong previously? I have not figured out this.

作者您好,我之所以想知道这个问题的答案,是因为我看到 LOMO 的优化器实现和save_model的代码假设了同样的 deepspeed 划分参数的方式,即每个参数经过展平后划分成若干块,第 i 块分配给第 i 个进程。我不确定 deepspeed 是否是这样划分参数的。因此,我在上面提供的save_model代码没有使用这样的假设,仅使用 deepspeed 提供的 deepspeed.zero.GatheredParameters 自动进行参数的聚合。让我意外的是,这一改动修复了save_model的 bug。因此我推测save_model出错的原因可能在于上述划分参数的假设不对。这动摇了我对 LOMO 优化器的实现的正确性的看法。希望作者能消除我的疑虑。

Hi, I want to know the answer to this question because I find the implementation of LOMO and the code of save_model assume the same layout of partitioned parameters in deepspeed, that is, each parameter is flattened and divided into chuncks, with the i-th chunck sent to the i-th process. I'm not sure if deepspeed splits the parameters that way. Therefore, the code of save_model I provided above does not use this assumption, only using deepspeed.Zero.GatheredParameters provided by deepspeed to gather parameters automatically. To my surprise, this change fixes the bug. Therefore, I speculate that the bug may lie in the wrong assumptions of parameters partitioning. This has shaken my opinion about the correctness of the implementation of the LOMO optimizer. I hope the author can address my doubts.

@KaiLv69 KaiLv69 self-assigned this Sep 17, 2023
@yueg-security
Copy link

@DingQiang2018 您好,我注意到作者按照您的建议修改了LOMOTrainer和LOMOLoRaTrainer,LOMOTrainer运行没有问题,但LOMOLoRaTrainer会在self.model.optimizer.partition_all_parameters()处报错,您是否遇到了同样的问题呢?谢谢!

@wasifferoze
Copy link

@DingQiang2018Hello, I noticed that the author modified LOMOTrainer and LOMOLoRaTrainer according to your suggestion. LOMOTrainer runs without problems, but LOMOLoRaTrainer will report an error at self.model.optimizer.partition_all_parameters(). Have you encountered the same problem? Thanks!

Yeah I am having this issue, did you find any solution?

@yueg-security
Copy link

@DingQiang2018Hello, I noticed that the author modified LOMOTrainer and LOMOLoRaTrainer according to your suggestion. LOMOTrainer runs without problems, but LOMOLoRaTrainer will report an error at self.model.optimizer.partition_all_parameters(). Have you encountered the same problem? Thanks!

Yeah I am having this issue, did you find any solution?
还没有解决……

@shawnricecake
Copy link

@DingQiang2018Hello, I noticed that the author modified LOMOTrainer and LOMOLoRaTrainer according to your suggestion. LOMOTrainer runs without problems, but LOMOLoRaTrainer will report an error at self.model.optimizer.partition_all_parameters(). Have you encountered the same problem? Thanks!

Yeah I am having this issue, did you find any solution?
还没有解决……

我也不能 在 merge llama with lora 之后得到相同的结果,很奇怪

@KaiLv69
Copy link
Collaborator

KaiLv69 commented Oct 18, 2023

@DingQiang2018 您好,我注意到作者按照您的建议修改了LOMOTrainer和LOMOLoRaTrainer,LOMOTrainer运行没有问题,但LOMOLoRaTrainer会在self.model.optimizer.partition_all_parameters()处报错,您是否遇到了同样的问题呢?谢谢!

Hi, lomo_lora_trainer中因为多了lora的optimizer所以不能通过model.optimizer来调用DeepSpeedZeRoOffload。我目前把lomo_lora_trainer.py中的save_model()回退到之前版本了。

@KaiLv69
Copy link
Collaborator

KaiLv69 commented Oct 18, 2023

很荣幸我的建议被采纳。我还想问问您对之前save_model出错的具体原因有什么看法吗?我还没想通。 It's my pleasure to see my advice being accepted. Moreover, may I ask you if you have any comment on why save_model went wrong previously? I have not figured out this.

Hi, 我想知道使用ChatGLM2的loss两种保存方法会差多少,不知道您是否还有记录?BTW,LLaMA会有同样的问题吗?

@shawnricecake
Copy link

@DingQiang2018 您好,我注意到作者按照您的建议修改了LOMOTrainer和LOMOLoRaTrainer,LOMOTrainer运行没有问题,但LOMOLoRaTrainer会在self.model.optimizer.partition_all_parameters()处报错,您是否遇到了同样的问题呢?谢谢!

Hi, lomo_lora_trainer中因为多了lora的optimizer所以不能通过model.optimizer来调用DeepSpeedZeRoOffload。我目前把lomo_lora_trainer.py中的save_model()回退到之前版本了。

Hi, 我注意到了,但是我目前还是没办法做到 merge 之后的 model 有相同的eval resutls。。。。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants