Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train error during evaluation with 1 GPU and train with multi GPU #82

Open
segalinc opened this issue Oct 13, 2023 · 2 comments
Open

train error during evaluation with 1 GPU and train with multi GPU #82

segalinc opened this issue Oct 13, 2023 · 2 comments

Comments

@segalinc
Copy link

Hi thanks for this contribution
as a small exercise I am training SD2 on the pokemon dataset
I precomputed the latents and it starts training on one gpu
However at the evaluation time I get the following error

File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/composer/trainer/trainer.py", line 2814, in _eval_loop
    self.state.outputs = self._original_model.eval_forward(self.state.batch)
  File "/fsx_vfx/users/csegalin/code/diffusion/diffusion/models/stable_diffusion.py", line 255, in eval_forward
    gen_images = self.generate(tokenized_prompts=prompts,
  File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/fsx_vfx/users/csegalin/code/diffusion/diffusion/models/stable_diffusion.py", line 464, in generate
    pred = self.unet(latent_model_input,
  File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/diffusers/models/unet_2d_condition.py", line 934, in forward
    sample = self.conv_in(sample)
  File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Calculated padded input size per channel: (162 x 2). Kernel size: (3 x 3). Kernel size can't be greater than actual input size`

this is my confguration

name: trial0 # Insert wandb run name
project: pokemon_sd2_256 # Insert wandb project name
seed: 17
eval_first: false
algorithms:
  low_precision_groupnorm:
    attribute: unet
    precision: amp_fp16
  low_precision_layernorm:
    attribute: unet
    precision: amp_fp16
model:
  _target_: diffusion.models.models.stable_diffusion_2
  pretrained: false
  precomputed_latents: true
  encode_latents_in_fp16: true
  fsdp: true
  val_metrics:
    - _target_: torchmetrics.MeanSquaredError
    - _target_: torchmetrics.image.fid.FrechetInceptionDistance
      normalize: true
  val_guidance_scales: [3, 7]
  # val_guidance_scales: []
  loss_bins: []
dataset:
  train_batch_size: 1 # Global training batch size
  eval_batch_size: 1  # Global evaluation batch size
  train_dataset:
    _target_: diffusion.datasets.pokemon.pokemon.build_streaming_dataloader
      # Path to object store bucket(s)
    local: /fsx_vfx/users/csegalin/data/pokemon/latents2_train
      # Path to corresponding local dataset(s)
    mode: 0
    version: 2
    drop_last: False
    shuffle: true
    prefetch_factor: 2
    num_workers: 8
    persistent_workers: true
    pin_memory: true
  eval_dataset:
    _target_: diffusion.datasets.pokemon.pokemon.build_streaming_dataloader
    local: /fsx_vfx/users/csegalin/data/pokemon/latents2_eval # Path to local dataset cache
    prefetch_factor: 2
    num_workers: 8
    persistent_workers: True
    pin_memory: True
    mode: 0
    version: 2
optimizer:
  _target_: torch.optim.AdamW
  lr: 1.0e-5
  weight_decay: 0.01
scheduler:
  _target_: composer.optim.LinearWithWarmupScheduler
  t_warmup: 1000ba
  alpha_f: 1.0
logger:
  comet-ml:
    _target_: composer.loggers.cometml_logger.CometMLLogger
    name: ${name}
    project_name: ${project}
callbacks:
  speed_monitor:
    _target_: composer.callbacks.speed_monitor.SpeedMonitor
    window_size: 10
  lr_monitor:
    _target_: composer.callbacks.lr_monitor.LRMonitor
  memory_monitor:
    _target_: composer.callbacks.memory_monitor.MemoryMonitor
  runtime_estimator:
    _target_: composer.callbacks.runtime_estimator.RuntimeEstimator
  optimizer_monitor:
    _target_: composer.callbacks.OptimizerMonitor
  image_monitor:
    _target_: diffusion.callbacks.log_diffusion_images.LogDiffusionImages
    prompts: # add any prompts you would like to visualize
    - cute dragon creature
    size: 256 # generated image resolution
    guidance_scale: 3
trainer:
  _target_: composer.Trainer
  device: gpu
  max_duration: 550000ba
  eval_interval: 1000ba
  device_train_microbatch_size: 1
  run_name: ${name}
  seed: ${seed}
  save_folder:  trained_model # Insert path to save folder or bucket
  save_interval: 3000ba
  save_overwrite: true
  autoresume: false
  # fsdp_config:
  #   sharding_strategy: "SHARD_GRAD_OP"

``

@segalinc
Copy link
Author

segalinc commented Oct 13, 2023

I think this related to the FID metrics as if I remove it all works

@segalinc segalinc changed the title train error during evaluation train error during evaluation with 1 GPU and train with multi GPU Oct 13, 2023
@segalinc
Copy link
Author

segalinc commented Oct 13, 2023

when I try to train on a multi gpu machine (resetting fspd to true) and uncommenting last two lines of the config and batch size accordingly I get this error

ValueError: The world_size(2) > 1 but dataloader does not use DistributedSampler. This will cause all ranks to train on the same data, removing any benefit from multi-GPU training. To resolve this, create a Dataloader with DistributedSampler. For example, DataLoader(..., sampler=composer.utils.dist.get_sampler(...)).Alternatively, the process group can be instantiated with composer.utils.dist.instantiate_dist(...) and DistributedSampler can directly be created with DataLoader(..., sampler=DistributedSampler(...)). For more information, see https://pytorch.org/docs/stable/data.html#torch.utils.data.distributed.DistributedSampler.

I don't see a distributesampler for the laion or coco functions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant