Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to Transformers 4.43 #1163

Merged
merged 34 commits into from
Aug 7, 2024
Merged

Upgrade to Transformers 4.43 #1163

merged 34 commits into from
Aug 7, 2024

Conversation

regisss
Copy link
Collaborator

@regisss regisss commented Jul 26, 2024

What does this PR do?

As per title.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@avbodas
Copy link

avbodas commented Jul 30, 2024

I tested this PR with text-generation example with this command with 1.16.2 docker image:
$ python run_generation.py --model_name_or_path meta-llama/Meta-Llama-3.1-8B --max_new_tokens 4096 --bf16 --use_hpu_graphs --use_kv_cache --batch_size 10 --attn_softmax_bf16 --limit_hpu_graphs --reuse_cache --trim_logits

it fails with log:
07/30/2024 04:45:57 - INFO - main - Args: Namespace(device='hpu', model_name_or_path='meta-llama/Meta-Llama-3.1-8B', bf16=True, max_new_tokens=4096, max_input_tokens=0, batch_size=10, warmup=3, n_iterations=5, local_rank=0, use_kv_cache=True, use_hpu_graphs=True, dataset_name=None, column_name=None, do_sample=False, num_beams=1, trim_logits=True, seed=27, profiling_warmup_steps=0, profiling_steps=0, profiling_record_shapes=False, prompt=None, bad_words=None, force_words=None, assistant_model=None, peft_model=None, num_return_sequences=1, token=None, model_revision='main', attn_softmax_bf16=True, output_dir=None, bucket_size=-1, bucket_internal=False, dataset_max_samples=-1, limit_hpu_graphs=True, reuse_cache=True, verbose_workers=False, simulate_dyn_prompt=None, reduce_recompile=False, use_flash_attention=False, flash_attention_recompute=False, flash_attention_causal_mask=False, flash_attention_fast_softmax=False, book_source=False, torch_compile=False, ignore_eos=True, temperature=1.0, top_p=1.0, const_serialization_path=None, disk_offload=False, trust_remote_code=False, quant_config='', world_size=0, global_rank=0)
07/30/2024 04:45:57 - INFO - main - device: hpu, n_hpu: 0, bf16: True
07/30/2024 04:45:57 - INFO - main - Model initialization took 10.293s
07/30/2024 04:45:57 - INFO - main - Graph compilation...
Warming up
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:567: UserWarning: do_sample is set to False. However, temperature is set to 0.6 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset temperature.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:572: UserWarning: do_sample is set to False. However, top_p is set to 0.9 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_p.
warnings.warn(
Traceback (most recent call last):
File "/optimum-habana/examples/text-generation/run_generation.py", line 666, in
main()
File "/optimum-habana/examples/text-generation/run_generation.py", line 433, in main
generate(None, args.reduce_recompile)
File "/optimum-habana/examples/text-generation/run_generation.py", line 404, in generate
outputs = model.generate(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/generation/utils.py", line 1276, in generate
result = self._sample(
File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/generation/utils.py", line 1767, in _sample
outputs = self(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1523, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 716, in forward
return wrapped_hpugraph_forward(
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 570, in wrapped_hpugraph_forward
return orig_fwd(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/llama/modeling_llama.py", line 1092, in forward
outputs = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1564, in _call_impl
result = forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/llama/modeling_llama.py", line 904, in forward
if isinstance(past_key_values[0][0], torch.Tensor):
File "/usr/local/lib/python3.10/dist-packages/transformers/cache_utils.py", line 314, in getitem
raise KeyError(f"Cache only has {len(self)} layers, attempted to access layer with index {layer_idx}")
KeyError: 'Cache only has 0 layers, attempted to access layer with index 0'

@regisss
Copy link
Collaborator Author

regisss commented Jul 30, 2024

@avbodas It should work with the latest commit I pushed

Copy link
Contributor

@imangohari1 imangohari1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @regisss
I added some comments in below based on the CI tests that were ran against this PR at

33mcommit 49d877437cdc6c6c6ecc72bd496495fdc135acc6 (HEAD -> transformers_4.43, orig �in/transformers_4.43)

Author: regisss <15324346+regisss@users.noreply.github.com>

Date:   Fri Jul 26 22:32:39 2024 +0000


    Fixes

They might not all be relevant or useful, but was wondering about your thoughts.

other observations:

  • we see this error related to peft but the installed version is peft 0.12.0 (satisfying >.10.0). Maybe 0.12.0 is not compatible?
[rank5]:   File "/root/optimum-habana/examples/language-modeling/run_lora_clm.py", line 841, in main
[rank5]:     trainer.accelerator.state.fsdp_plugin.auto_wrap_policy = fsdp_auto_wrap_policy(lora_model)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/peft/utils/other.py", line 396, in fsdp_auto_wrap_policy
[rank5]:     transformer_cls = FullyShardedDataParallelPlugin.get_module_class_from_name(model, layer_class)
[rank5]: AttributeError: type object 'FullyShardedDataParallelPlugin' has no attribute 'get_module_class_from_name`
  • for bridge tower, we see this issue on remote data access. Does this need a fix too?
nk3]:   File "/root/optimum-habana/examples/contrastive-image-text/run_bridgetower.py", line 625, in <module>
[rank3]:     main()
[rank3]:   File "/root/optimum-habana/examples/contrastive-image-text/run_bridgetower.py", line 323, in main
[rank3]:     dataset = load_dataset(
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 2594, in load_dataset
[rank3]:     builder_instance = load_dataset_builder(
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 2266, in load_dataset_builder
[rank3]:     dataset_module = dataset_module_factory(
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 1914, in dataset_module_factory
[rank3]:     raise e1 from None
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 1887, in dataset_module_factory
[rank3]:     ).get_module()
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 1538, in get_module
[rank3]:     raise ValueError(
[rank3]: ValueError: Loading jmhessel/newyorker_caption_contest requires you to execute the dataset script in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option `trust_remote_code=True` to remove this error.

@vidyasiv
Copy link
Contributor

@regisss , I see failures for python -m pytest tests/transformers/tests/models -v -s like so:
=========================== short test summary info ============================

ERROR tests/transformers/tests/models/albert/test_modeling_albert.py
ERROR tests/transformers/tests/models/bert/test_modeling_bert.py
ERROR tests/transformers/tests/models/bridgetower/test_modeling_bridgetower.py
ERROR tests/transformers/tests/models/distilbert/test_modeling_distilbert.py
ERROR tests/transformers/tests/models/gpt2/test_modeling_gpt2.py
ERROR tests/transformers/tests/models/gptj/test_modeling_gptj.py
ERROR tests/transformers/tests/models/roberta/test_modeling_roberta.py
ERROR tests/transformers/tests/models/swin/test_modeling_swin.py
ERROR tests/transformers/tests/models/t5/test_modeling_t5.py
ERROR tests/transformers/tests/models/vit/test_modeling_vit.py
!!!!!!!!!!!!!!!!!!! Interrupted: 10 errors during collection !!!!!!!!!!!!!!!!!!!
======================== 5 warnings, 10 errors in 6.80s ========================

In my debug I found the following:

Each model has an error like ImportError: cannot import name 'VIT_PRETRAINED_MODEL_ARCHIVE_LIST' from 'transformers.models.vit.modeling_vit' (/usr/local/lib/python3.10/dist-packages/transformers/models/vit/modeling_vit.py) which seems related to this PR where those deprecated variables all got finally removed.

Per transformers tests example:
https://github.com/huggingface/transformers/blob/v4.43.3/tests/models/vit/test_modeling_vit.py#L250-L254
, they appear to now use specific models instead of the archive list variable so a fix could resemble:

+++ b/tests/transformers/tests/models/vit/test_modeling_vit.py
@@ -37,7 +37,6 @@ if is_torch_available():
    import torch
    from torch import nn
    from transformers import ViTForImageClassification, ViTForMaskedImageModeling, ViTModel
-    from transformers.models.vit.modeling_vit import VIT_PRETRAINED_MODEL_ARCHIVE_LIST
if is_vision_available():
@@ -245,9 +244,9 @@ class ViTModelTest(ModelTesterMixin, unittest.TestCase):
    @slow
    def test_model_from_pretrained(self):
-        for model_name in VIT_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
-            model = ViTModel.from_pretrained(model_name)
-            self.assertIsNotNone(model)
+        model_name = "google/vit-base-patch16-224"
+        model = ViTModel.from_pretrained(model_name)
+        self.assertIsNotNone(model)

I tested that it will get past the error and run the tests.

@regisss
Copy link
Collaborator Author

regisss commented Jul 31, 2024

@imangohari1 It's solved for PEFT and BridgeTower. I'll have to check that the other examples with PEFT work with v0.12.

@vidyasiv I added the changes for all these tests.

@yafshar
Copy link
Contributor

yafshar commented Jul 31, 2024

@regisss please include #1086 here so FSDP tests works as expected!

@imangohari1
Copy link
Contributor

@imangohari1 It's solved for PEFT and BridgeTower. I'll have to check that the other examples with PEFT work with v0.12.

@vidyasiv I added the changes for all these tests.

Thank you! I have started doing some subset testing and will do another integrated CI job.
Meanwhile I am attaching a patch that fixes some of the failures in diffuser tests.
They should be all functional as I tested them with your changes (details below).

0001-fea-diffuser-tests-Updated-the-tests-for-4.43.patch

Please review them and apply em with git am < 0001* and we can see the results in the next CI job.

diffuser failed cases

FAILED tests/test_diffusers.py::GaudiStableDiffusionXLPipelineTester::test_stable_diffusion_xl_euler_ancestral - AssertionError: 0.032823339271545404 not less than 0.01
FAILED tests/test_diffusers.py::GaudiStableDiffusionXLPipelineTester::test_stable_diffusion_xl_turbo_euler_ancestral - AssertionError: 0.032823339271545404 not less than 0.01
FAILED tests/test_diffusers.py::GaudiStableVideoDiffusionPipelineTester::test_stable_video_diffusion_single_video - AssertionError: tensor(0.0375, dtype=torch.float64) not less than 0.01
FAILED tests/test_diffusers.py::StableDiffusionXLInpaintPipelineFastTests::test_stable_diffusion_xl_inpaint_euler - AssertionError: assert 0.03283219404220583 < 0.01
FAILED tests/test_diffusers.py::StableDiffusionXLInpaintPipelineFastTests::test_stable_diffusion_xl_refiner - AssertionError: assert 0.05005731153488158 < 0.0

I updated two from here: https://github.com/huggingface/diffusers/blob/main/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl.py#L181-L196

The other 3 are own tests without any upstream version.
For those, we got the expected values by importing the original SVD and SDXL pipilines and got the values., i.e.

diff --git a/tests/test_diffusers.py b/tests/test_diffusers.py
index 1def508..b4c0a92 100755
--- a/tests/test_diffusers.py
+++ b/tests/test_diffusers.py
@@ -42,6 +42,9 @@ from diffusers import (
     DiffusionPipeline,
     DPMSolverMultistepScheduler,
     EulerDiscreteScheduler,
+    EulerAncestralDiscreteScheduler,
+    StableDiffusionXLPipeline,
+    StableVideoDiffusionPipeline,
     LCMScheduler,
     PNDMScheduler,
     UNet2DConditionModel,
@@ -958,7 +961,8 @@ class GaudiStableDiffusionXLPipelineTester(TestCase):
         device = "cpu"  # ensure determinism for the device-dependent torch.Generator
         components = self.get_dummy_components()
         gaudi_config = GaudiConfig(use_torch_autocast=False)
-        sd_pipe = GaudiStableDiffusionXLPipeline(use_habana=True, gaudi_config=gaudi_config, **components)
+        sd_pipe = StableDiffusionXLPipeline( **components)
+        #sd_pipe = GaudiStableDiffusionXLPipeline(use_habana=True, gaudi_config=gaudi_config, **components)
         sd_pipe.set_progress_bar_config(disable=None)
 
         inputs = self.get_dummy_inputs(device)
@@ -977,8 +981,10 @@ class GaudiStableDiffusionXLPipelineTester(TestCase):
         device = "cpu"  # ensure determinism for the device-dependent torch.Generator
         components = self.get_dummy_components()
         gaudi_config = GaudiConfig(use_torch_autocast=False)
-        sd_pipe = GaudiStableDiffusionXLPipeline(use_habana=True, gaudi_config=gaudi_config, **components)
-        sd_pipe.scheduler = GaudiEulerAncestralDiscreteScheduler.from_config(sd_pipe.scheduler.config)
+        #sd_pipe = GaudiStableDiffusionXLPipeline(use_habana=True, gaudi_config=gaudi_config, **components)
+        #sd_pipe.scheduler = GaudiEulerAncestralDiscreteScheduler.from_config(sd_pipe.scheduler.config)
+        sd_pipe = StableDiffusionXLPipeline(**components)
+        sd_pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(sd_pipe.scheduler.config)
         sd_pipe.set_progress_bar_config(disable=None)
 
         inputs = self.get_dummy_inputs(device)
@@ -2196,7 +2202,8 @@ class GaudiStableVideoDiffusionPipelineTester(TestCase):
         device = "cpu"  # ensure determinism for the device-dependent torch.Generator
         components = self.get_dummy_components()
         gaudi_config = GaudiConfig(use_torch_autocast=False)
-        sd_pipe = GaudiStableVideoDiffusionPipeline(use_habana=True, gaudi_config=gaudi_config, **components)
+        sd_pipe = StableVideoDiffusionPipeline( **components)
+        #sd_pipe = GaudiStableVideoDiffusionPipeline(use_habana=True, gaudi_config=gaudi_config, **components)
         for component in sd_pipe.components.values():
             if hasattr(component, "set_default_attn_processor"):
                 component.set_default_attn_processor()

I talked to both @dsocek and @skavulya. We might want to consider replacing these with some upstream tests, or automate them to get the expected values during the runtime. The latter case would increase the runtime.

Results (with patch)

python -m pytest tests/test_diffusers.py -s -v -k test_stable_video_diffusion_single_video
.
.
================ 1 passed, 140 deselected, 6 warnings in 9.43s =================
python -m pytest tests/test_diffusers.py -s -v -k test_stable_diffusion_xl_
.
.
==== 21 passed, 1 skipped, 119 deselected, 18 warnings in 202.51s (0:03:22) ====

Copy link
Contributor

@imangohari1 imangohari1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @regisss
Thanks for adding the remote trust to the tests.
Minor fixes for whisper and wav2vec as well in attached patch.
I tested them and they are working.
0002-fea-added-trust-remote-for-whisper-and-wav2vec2.patch

I also have seen this repeated error on llava_next that would like your intake on.

@regisss
Copy link
Collaborator Author

regisss commented Aug 1, 2024

@yafshar Done!

@imangohari1 I added the patches and solved the issue for Llava-next. Don't hesitate to open PRs to merge your patches into this branch if you have other changes to sugest 🙂

@regisss
Copy link
Collaborator Author

regisss commented Aug 1, 2024

@vivekgoe Regarding the Transformers 4.43 PR, the BERT FSDP test now fails because the trainer tries to save the FSDP model at the end of training. It didn't use to happen with Transformers 4.40, but still we should be able to save the FSDP model.
Sharing here the error I get:

File "/root/workspace/optimum-habana/examples/question-answering/run_qa.py", line 679, in main                                             
    train_result = trainer.train(resume_from_checkpoint=checkpoint)                                                                          
  File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 553, in train                                  
    return inner_training_loop(                                                                                                              
  File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1052, in _inner_training_loop                  
    self._maybe_log_save_evaluate(tr_loss, _grad_norm, model, trial, epoch, ignore_keys_for_eval)                                            
  File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1269, in _maybe_log_save_evaluate              
    self._save_checkpoint(model, trial, metrics=metrics)                                                                                     
  File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1331, in _save_checkpoint                      
    self._save_optimizer_and_scheduler(output_dir)                                                                                           
  File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1412, in _save_optimizer_and_scheduler         
    save_fsdp_optimizer(                                                                                                                     
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/fsdp_utils.py", line 168, in save_fsdp_optimizer                            
    optim_state = FSDP.optim_state_dict(model, optimizer)                                                                                    
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1838, in optim_state_dict       
    state_dict_settings.optim_state_dict_config, "rank0_only", False                                                                         
AttributeError: 'NoneType' object has no attribute 'optim_state_dict_config'

The issue is that fsdp_modules returns an empty list. Any idea why?

@regisss
Copy link
Collaborator Author

regisss commented Aug 1, 2024

@imangohari1 I see that the Llama fp8 text-generation tests failed with:

RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_LAZY Error, ValidateSyncInputTensors tensor_data is empty. Tensorid:18390 QueueStatus:ThreadPool m_tasks size: 0 irValue:id_46126_model/hpu__input

Do you get the same?

@emascarenhas
Copy link
Contributor

@imangohari1 I see that the Llama fp8 text-generation tests failed with:

RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_LAZY Error, ValidateSyncInputTensors tensor_data is empty. Tensorid:18390 QueueStatus:ThreadPool m_tasks size: 0 irValue:id_46126_model/hpu__input

Do you get the same?

Yes we get the same error.

@imangohari1
Copy link
Contributor

@imangohari1 I see that the Llama fp8 text-generation tests failed with:

RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_LAZY Error, ValidateSyncInputTensors tensor_data is empty. Tensorid:18390 QueueStatus:ThreadPool m_tasks size: 0 irValue:id_46126_model/hpu__input

Do you get the same?

Hi @regisss ,
Thank you for the help thus far. I will open PRs moving forward.

Here are my comments/questions/updates from my today's tests:

  • Wrt the text-gen ValidateSyncInputTensors tensor_data is empty:
    • I do see the same issue with 1.17 driver/1.17contianer, but not on 1.16driver/1.17 container.
    • I tested test_text_generation_fp8[token0-meta-llama/Llama-2-7b-hf-1-163-False-128-2048-4774.7] from OH main branch on 1.17 and it crashes with RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_SYNHELPER workspace Allocation of size ::1470892032 failed!
      • I reduced the bs on this experiment to 128 and it passed the OOM.
    • I (re)tested this branch with lowered bs=128, and =32 for test_text_generation_fp8[token0-meta-llama/Llama-2-7b-hf and still crashes with the same tensor_data is empty error.
    • Might need to dig a bit on this.
  • speech-recognition-multicard passes the functional failures. thanks!
  • The llava_next tests are now passing the functional failures. thanks!
  • I have ran the 5 (previously failed) tests that had remote_trust errors locally, and 4 pass the previous failures now, but bridgetower_bridgetower-large-itm-mlm-itc_multi_card.log is failing with the below issue. I will try to see if I can root cause this.
[rank3]:     x = self.transforms(x)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank3]:     result = forward_call(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 217, in forward
[rank3]:     input = module(input)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank3]:     result = forward_call(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torchvision/transforms/transforms.py", line 354, in forward
[rank3]:     return F.resize(img, self.size, self.interpolation, self.max_size, self.antialias)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torchvision/transforms/functional.py", line 456, in resize
[rank3]:     _, image_height, image_width = get_dimensions(img)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torchvision/transforms/functional.py", line 80, in get_dimensions
[rank3]:     return F_pil.get_dimensions(img)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torchvision/transforms/_functional_pil.py", line 31, in get_dimensions
[rank3]:     raise TypeError(f"Unexpected type {type(img)}")
[rank3]: TypeError: Unexpected type <class 'NoneType'>
  • The deepspeed sumerization task with flan-t5-xxl is failing the test since the output of model loss is 0. I checked this with 1.16 driver/1.16 container and got the same results. Is there something we need to update for this?
deepspeed --num_nodes 1 --num_gpus 8 run_summarization.py --model_name_or_path google/flan-t5-xxl --gaudi_config_name Habana/t5 --dataset_name cnn_dailymail --do_train --output_dir /tmp/tmpy1_wto8t --overwrite_output_dir --learning_rate 0.0001 --per_device_train_batch_size 22 --per_device_eval_batch_size 22 --num_train_epochs 2 --use_habana --throughput_warmup_steps 3 --save_strategy no --use_lazy_mode --do_eval --max_steps 10 --max_eval_samples 880 --dataset_config 3.0.0 --source_prefix summarize: --predict_with_generate --ignore_pad_token_for_loss False --pad_to_max_length --generation_max_length 129 --gradient_checkpointing --adam_epsilon 1e-08 --deepspeed ds_flan_t5_z3_config_bf16.json
.
.
***** eval metrics *****
  epoch                       =     0.0061
  eval_gen_len                =        0.0
  eval_loss                   =     0.9052
  eval_rouge1                 =        0.0 << here
  eval_rouge2                 =        0.0 << here
  eval_rougeL                 =        0.0  << here
  eval_rougeLsum              =        0.0 << here
  eval_runtime                = 0:03:16.03
  eval_samples                =        880
  eval_samples_per_second     =       4.73
  eval_steps_per_second       =      0.027
  max_memory_allocated (GB)   =      94.59
  memory_allocated (GB)       =       28.2
  total_memory_available (GB) =      94.62
    assert_function(
E   AssertionError: 0.0 not greater than or equal to 0.14147099999999999 : for metric eval_rougeLsum. 
E   ===== Assessed metrics (measured vs thresholded baseline) =====
E   eval_rougeLsum: 0.0 vs 0.14147099999999999
E   train_runtime: 92.3056 vs 93.9603
E   train_samples_per_second: 27.418 vs 25.93405
  • Thank you for pushing the fix for deprecated XX_ARCHIVE_LIS flags. Now our transformer tests, i.e. python -m pytest tests/transformers/tests/models -v -s , is the post populated one with +100 failures.
    Below is the category of the tests. I understand those have not been updated, but would you like us to help with them? if so, could you point us where to start finding the equivalents for changes made in Upgrade to Transformers 4.40 #1027 (4.40 update). Thank you.
    • has no attribute: 81
    • got an unexpected keyword argument 'token_idx' errors: 16
    • required positional arguments: 14
    • Not Implemented errors: 3
      Examples
FAILED tests/transformers/tests/models/bert/test_modeling_bert.py::BertModelTest::test_beam_search_generate - AttributeError: 'BertLMHeadModel' object has no attribute 'beam_search'. Di...
FAILED tests/transformers/tests/models/bert/test_modeling_bert.py::BertModelTest::test_beam_search_generate_dict_output - AttributeError: 'BertLMHeadModel' object has no attribute 'beam_search'. Di...
FAILED tests/transformers/tests/models/bert/test_modeling_bert.py::BertModelTest::test_beam_search_generate_dict_outputs_use_cache - AttributeError: 'BertLMHeadModel' object has no attribute 'beam_search'. Di...
FAILED tests/transformers/tests/models/bert/test_modeling_bert.py::BertModelTest::test_constrained_beam_search_generate - AttributeError: 'BertLMHeadModel' object has no attribute 'constrained_beam...
FAILED tests/transformers/tests/models/bert/test_modeling_bert.py::BertModelTest::test_constrained_beam_search_generate_dict_output - AttributeError: 'BertLMHeadModel' object has no attribute 'constrained_beam...
FAILED tests/transformers/tests/models/bert/test_modeling_bert.py::BertModelTest::test_contrastive_generate - TypeError: GenerationMixin._contrastive_search() missing 3 required positio...
FAILED tests/transformers/tests/models/bert/test_modeling_bert.py::BertModelTest::test_contrastive_generate_dict_outputs_use_cache - TypeError: GenerationMixin._contrastive_search() missing 3 required positio...
FAILED tests/transformers/tests/models/bert/test_modeling_bert.py::BertModelTest::test_greedy_generate - AttributeError: 'BertLMHeadModel' object has no attribute 'greedy_search'

@regisss
Copy link
Collaborator Author

regisss commented Aug 2, 2024

@imangohari1

  • For BridgeTower, it seems the mediapipe dataloader fails at some point because it's falling back on Torch dataloader (which should also work...). Maybe this PR can help?
  • For flan-T5, hard to say. It looks to me like a data type error, I'll check later.
  • For Transformers tests, I think you can just compare with https://github.com/huggingface/transformers/tree/v4.43.3/tests/models. From what I see, many errors you get are due to calling sample (or another decoding strategy) and not _sample. And greedy_search doesn't exist anymore, it's a special case of sampling (see
    # 13. run sample (it degenerates to greedy search when `generation_config.do_sample=False`)
    ).

@regisss regisss merged commit 34c780e into main Aug 7, 2024
7 checks passed
@regisss regisss deleted the transformers_4.43 branch August 7, 2024 12:29
@regisss
Copy link
Collaborator Author

regisss commented Aug 7, 2024

Merging as this PR is becoming too big. I'll open a new PR for Synapse 1.17 specific changes. Let's open new PRs for other fixes, including AFAIK:

  • Fix FSDP test, I'll do it
  • Fix Flan-T5 DeepSpeed test
  • There are issues with Llama 3.1

@regisss
Copy link
Collaborator Author

regisss commented Aug 7, 2024

Measure the model on a number of cards that are enough for the model to fit in BF16.
Quantize the model on the same amount of cards for scales to be saved.
Run unify_measurements.py script using the measurement files created after running steps 1 and 2. A unified measurement is then calculated.

Are these steps required for models that are already quantized to fp8 e.g. llama 405b FP8? https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct-FP8

I don't have two Gaudi boxes available, so I don't know if I should defer this to your team.

@endomorphosis I'm not sure the fp8 checkpoints are compatible with how we do fp8 quantization on Gaudi

@imangohari1
Copy link
Contributor

Merging as this PR is becoming too big. I'll open a new PR for Synapse 1.17 specific changes. Let's open new PRs for other fixes, including AFAIK:

* Fix FSDP test, I'll do it

* Fix Flan-T5 DeepSpeed test

* There are issues with Llama 3.1

for * Fix Flan-T5 DeepSpeed test: #1224

@regisss
Copy link
Collaborator Author

regisss commented Aug 7, 2024

Merging as this PR is becoming too big. I'll open a new PR for Synapse 1.17 specific changes. Let's open new PRs for other fixes, including AFAIK:

* Fix FSDP test, I'll do it

* Fix Flan-T5 DeepSpeed test

* There are issues with Llama 3.1

for * Fix Flan-T5 DeepSpeed test: #1224

So we just needed to train the model a bit more?

@imangohari1
Copy link
Contributor

Merging as this PR is becoming too big. I'll open a new PR for Synapse 1.17 specific changes. Let's open new PRs for other fixes, including AFAIK:

* Fix FSDP test, I'll do it

* Fix Flan-T5 DeepSpeed test

* There are issues with Llama 3.1

for * Fix Flan-T5 DeepSpeed test: #1224

So we just needed to train the model a bit more?

Yes.
@yeonsily cross validated a longer training run with R&D results and they were in agreement.
Adding few more steps here were sufficient to get pass through the failure.

check_min_version("4.40.0")
check_optimum_habana_min_version("1.11.0")
check_min_version("4.43.0")
check_optimum_habana_min_version("1.12.0")
Copy link

@Sanatan-Shrivastava Sanatan-Shrivastava Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I am trying to run this test with versions (optimum-habana==1.12.1, transformers==4.43.0) but encountered the following error:

ERROR: Cannot install optimum-habana==1.12.1 and transformers==4.43.0 because these package versions have conflicting dependencies.\n#8 20.54 \n#8 20.54 The conflict is caused by:\n#8 20.54     The user requested transformers==4.43.0\n#8 20.54     optimum-habana 1.12.1 depends on transformers<4.41.0 and >=4.40.0

I also tried with optimum-habana version 1.12.0 and encountered the same error.

Can someone please point me to the correction PR or if there is any documentation to fix this?
Thanks!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should pip install optimimum habana from git, because this fork is ahead of 1.12.1, and have different dependencies, and this should be listed as 1.13.0.dev as the version number.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it should be check_optimum_habana_min_version("1.13.0.dev0"), I'll add a script to do that automatically.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should pip install optimimum habana from git, because this fork is ahead of 1.12.1, and have different dependencies, and this should be listed as 1.13.0.dev as the version number.

Ok great, Thanks for the info. So, currently, I am using pip install git+https://github.com/huggingface/optimum-habana.git@{{ optimum_habana_version }}. Would using pip install git+https://github.com/huggingface/optimum-habana.git to use the latest version that is compatible with transformers==4.43.0 work for future transformers versions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not work with future Transformers versions as there might be changes that are not compatible with what we do here in Optimum Habana.
In the coming weeks, I will open a new branch and try to maintain it so that new Transformers releases are supported but with potential perf regressions (which will be solved once it comes to the main branch).

Copy link

@Sanatan-Shrivastava Sanatan-Shrivastava Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @regisss, that would be helpful.
Until then, I will be using transformers==4.43.0 while doing pip install git+https://github.com/huggingface/optimum-habana.git. It worked for me yesterday without specifying the optimum-habana version.
Can you please confirm if this^ would continue working for 4.43.0 only, please?
For future transformers' version - I'll be keep an eye out for the new branch you mentioned.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the main branch should work for Transformers 4.43.x till the next time we align the lib with a new version of Transformers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.