[BE] Combine OptimizerWrapper and OptimizerContainer #738

mori360 · 2024-12-13T23:58:44Z

Combine state_dict and load_state_dict from OptimizerWrapper to OptimizerContainer so that we only have one optimzier related class
Also, add get_lr_scheduler_state to SchedulersContainer when update lr_scheduler at self.state

torchtitan/checkpoint.py

torchtitan/optimizer.py

fegin

Mostly good to me. But I do prefer to simplify the code by removing some duplicated variables.

fegin · 2024-12-17T18:08:56Z

torchtitan/optimizer.py

@@ -25,22 +38,49 @@ def __init__(self, model_parts, optimizer_kwargs, name):
            else:
                raise NotImplementedError(f"Optimizer {name} not added.")
            self.optimizers.append(optimizer)
+        self.plain_optim = (


Can we deduplicate self.plain_optim and self.optimizers? I don't see a reason why do we need to keep both variables.

Thank you for the comment. Yeah, self.plain_optim is only used for get/load state_dict APIs at checkpoint.
Make self.optimizers to be plain this time, from List[List[optim]] to List[optim] for backward case
Also make some changes related to optimizers.optimizers, to remove SchedulersInBackwardContainer and modify assert check at CheckpointManager

fegin · 2024-12-17T18:09:04Z

torchtitan/optimizer.py

-    def __init__(self, model_parts, optimizer_kwargs, name):
+    def __init__(
+        self, model_parts: List[nn.Module], optimizer_kwargs: Dict[str, Any], name: str
+    ) -> None:
        self.optimizers = []


…f.plain_optimizers

mori360 · 2024-12-17T19:57:35Z

torchtitan/checkpoint.py

@@ -220,44 +176,29 @@ def __init__(

            TODO: This is currently unsolved and needs a fix.
        """
-        assert len(model_parts) == len(
+        if job_config.optimizer.early_step_in_backward:


We make the optimizers.optimizers to be plain for both in backward or not in.
Thus len(model_parts) does not work for in backward case.
What do you think of the assert check here and error message?

I think we should move the sanity check to OptimizerContainer's __init__ constructors, rather than doing it here. Basically nothing would've changed after init, and we don't need to check it every time we do checkpoint save.

tianyu-l · 2024-12-18T03:28:42Z

torchtitan/checkpoint.py

@@ -220,44 +176,29 @@ def __init__(

            TODO: This is currently unsolved and needs a fix.
        """
-        assert len(model_parts) == len(
+        if job_config.optimizer.early_step_in_backward:


I think we should move the sanity check to OptimizerContainer's __init__ constructors, rather than doing it here. Basically nothing would've changed after init, and we don't need to check it every time we do checkpoint save.

torchtitan/optimizer.py

tianyu-l · 2024-12-18T04:24:51Z

torchtitan/checkpoint.py

-                # It should only support saving and loading a distributed checkpoint with the same number of pp ranks
-                for idx, lr_scheduler in enumerate(lr_schedulers.schedulers):
-                    self.states[f"lr_scheduler_{idx}"] = lr_scheduler
+        self.states.update(lr_schedulers.get_lr_scheduler_state())


Discussed with @mori360 offline: I think as a next step, we should still have a single entry for lr_scheduler. To achieve that, we need to understand if any further flattening is needed. This could potentially solve the "PP multi-schedule doesn't support DCP resharding" problem.

tianyu-l · 2024-12-18T04:28:29Z

torchtitan/optimizer.py

+            # For now, pipeline-parallel with looped schedules does not support resharding for lr_scheduler.
+            # It should only support saving and loading a distributed checkpoint with the same number of pp ranks


I feel this could be solved, if we do flattening on lr_scheduler state_dict similar to what we did to models and optimizers.

This is hard to achieve. LRScheduler's state_dict is very unstructured. It basically just returns self.__dict__.items(), which can be anything. And LRScheduler doesn't define a parameter group structure. So different LRScheduler may have different implementations. I'm not sure how to flatten LRScheduler in a general approach. Unless we only focus on one LRScheduler.

@fegin I see. Then maybe we can just focus on the LambdaLR, which should be straightforward -- since every scheduler has the same schedule, we can store only one state and recreate schedulers for each optimizer on the fly when doing checkpoint loading.

ye, I want to emphasize that the direction is correct to flatten LRScheduler, which TorchRec also does, iirc. But because we may not have bandwidths to support all of them, so focusing on one or two and claim what TorchTitan supports is a good idea.

tianyu-l

lgtm, thank you! Let's follow up in another PR for the lr scheduler flattening.
Please address final comment before merging.

torchtitan/optimizer.py

combine optim and lr_scheduler together

5011167

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 13, 2024

mori360 requested review from fegin and tianyu-l December 14, 2024 00:32

mori360 marked this pull request as ready for review December 14, 2024 00:32

change func name

fbcbf66

tianyu-l reviewed Dec 16, 2024

View reviewed changes

torchtitan/checkpoint.py Show resolved Hide resolved

torchtitan/optimizer.py Outdated Show resolved Hide resolved

torchtitan/optimizer.py Outdated Show resolved Hide resolved

fegin reviewed Dec 16, 2024

View reviewed changes

torchtitan/optimizer.py Outdated Show resolved Hide resolved

torchtitan/optimizer.py Outdated Show resolved Hide resolved

mori360 marked this pull request as draft December 16, 2024 19:08

mori360 added 2 commits December 16, 2024 11:14

remove optimizerwrapper, combine update model into init

3a5b16f

add typing

fa4eef9

fegin reviewed Dec 17, 2024

View reviewed changes

mori360 added 2 commits December 17, 2024 11:06

restructure optimierInBackward class, combine self.optimziers and sel…

bcb144c

…f.plain_optimizers

change num_optim check due to changes at optimiers.optimiers

491d372

mori360 marked this pull request as ready for review December 17, 2024 19:48

mori360 commented Dec 17, 2024

View reviewed changes

mori360 requested review from fegin and tianyu-l December 18, 2024 01:04

tianyu-l reviewed Dec 18, 2024

View reviewed changes

git assert check to container, optimizer some names

7fe2928

tianyu-l approved these changes Dec 19, 2024

View reviewed changes

torchtitan/optimizer.py Outdated Show resolved Hide resolved

isolate _validate_length

f91c1da

mori360 merged commit ba24697 into pytorch:main Dec 20, 2024
4 checks passed

tianyu-l mentioned this pull request Jan 23, 2025

[BE] Lr schduler flatten #794

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BE] Combine OptimizerWrapper and OptimizerContainer #738

[BE] Combine OptimizerWrapper and OptimizerContainer #738

mori360 commented Dec 13, 2024

fegin left a comment

fegin Dec 17, 2024

mori360 Dec 17, 2024

fegin Dec 17, 2024

mori360 Dec 17, 2024

tianyu-l Dec 18, 2024

tianyu-l Dec 18, 2024

tianyu-l Dec 18, 2024

tianyu-l Dec 18, 2024

fegin Dec 18, 2024

tianyu-l Dec 18, 2024

fegin Dec 18, 2024

tianyu-l left a comment

		# For now, pipeline-parallel with looped schedules does not support resharding for lr_scheduler.
		# It should only support saving and loading a distributed checkpoint with the same number of pp ranks

[BE] Combine OptimizerWrapper and OptimizerContainer #738

[BE] Combine OptimizerWrapper and OptimizerContainer #738

Conversation

mori360 commented Dec 13, 2024

fegin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tianyu-l left a comment

Choose a reason for hiding this comment