[Aggregator | Executor: torch_model_adapter] #243

EricDinging · 2023-10-31T04:34:14Z

What happened + What you expected to happen

I validated Revert "Fix argument order & renaming" #236, comparing the result of the current version, and the old version (before the PR) on FEMNIST fed-yogi task.

Old
Current

However, I find the testing accuracy is pretty strange, for both current version and the old version.

Old
Current

I have a suspicion that in this line where the executor receives the model update from aggregator and just about to start testing, the yogi optimizer is executed, which is unnecessary. I think the optimizer should only be executed in the aggregator at the end of the round. However, here in executor, the optimizer is executed every time when there is a model_train or model_test event.

FedScale/fedscale/cloud/execution/executor.py

Line 187 in e62ad70

self.model_adapter.set_weights(model_weights)

Why in previous version fed-yogi works? I had a suspicion that in some config files, such as

FedScale/benchmark/configs/openimage/openimage.yml

Line 54 in e62ad70

- gradient_policy: yogi # {"fed-yogi", "fed-prox", "fed-avg"}, "fed-avg" by default

it's written as "yogi" instead of "fed-yogi". As a result, in optimizers.py, the "real" optimizer is still fed-avg as there is no if statement for "yogi". So the fed-yogi bug is not exposed.

FedScale/fedscale/cloud/aggregation/optimizers.py

Line 82 in e62ad70

else:
In summary, I think there is still bug in executor set_weight. It works for fed-avg, but not for other optimizers. Let me know what you think.

Versions / Dependencies

#242

Reproduction script

fedscale driver start benchmark/configs/femnist/conf.yml

Issue Severity

None

EricDinging · 2023-11-01T15:50:55Z

I tried to change the model from resnet18 to mobilenet_v2.

I also changed the server side learning rate from 0.05 to 0.01.

As you can see, the test accuracy did not improve across 100 rounds. However, the training loss did decrease from 4 to 1 roughly for mobilenet and from 4.12 to 0.62 for lr=0.01.

fanlai0990 · 2023-11-01T16:11:28Z

Okay. I think we can use fedavg to train a model and see whether the training loss decreases to 1. This helps us understand whether the training part is correct. If it decreases to 1, then we'll know there must be something wrong on the testing side.

EricDinging · 2023-11-01T16:29:51Z

The FedAvg case I ran previously (lr = 0.05, apart from optimizer, the rest is exactly the same as above):

let me know what you think

SISICHEN565 · 2023-12-15T06:10:38Z

Hello, I met the similar problem when I try to use Yogi in the Oort with FEMNIST. Although the training loss can decrease to 1, the test loss is still high and remains around 4, and the test accuracy can not increase and remains below 0.01.

In addition, everything looks normal without using Yogi.

Are there any bugs in the code for Yogi? Do you have any idea how to deal with it?

EricDinging · 2023-12-17T13:25:37Z

@SISICHEN565
Thank you for your feedback! It is a known issue and we are fixing it. Check out this pr #245
When you are running FEMNIST task with fed-yogi, you could add this setup in benchmark/configs/femnist/conf.yml. The default config might not work well for FEMNIST. Also try running for 500 rounds.

    - yogi_eta: 0.01
    - yogi_tau: 0.001
    - yogi_beta: 0.01
    - yogi_beta2: 0.99

EricDinging added the bug Something isn't working label Oct 31, 2023

EricDinging mentioned this issue Dec 17, 2023

Fix fed-yogi executor model download #245

Merged

5 tasks

EricDinging closed this as completed Dec 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Aggregator | Executor: torch_model_adapter] #243

[Aggregator | Executor: torch_model_adapter] #243

EricDinging commented Oct 31, 2023 •

edited

Loading

EricDinging commented Nov 1, 2023

fanlai0990 commented Nov 1, 2023

EricDinging commented Nov 1, 2023

SISICHEN565 commented Dec 15, 2023

EricDinging commented Dec 17, 2023 •

edited

Loading

[Aggregator | Executor: torch_model_adapter] #243

[Aggregator | Executor: torch_model_adapter] #243

Comments

EricDinging commented Oct 31, 2023 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

EricDinging commented Nov 1, 2023

fanlai0990 commented Nov 1, 2023

EricDinging commented Nov 1, 2023

SISICHEN565 commented Dec 15, 2023

EricDinging commented Dec 17, 2023 • edited Loading

EricDinging commented Oct 31, 2023 •

edited

Loading

EricDinging commented Dec 17, 2023 •

edited

Loading