Upgrade to Transformers 4.43 #1163

regisss · 2024-07-26T21:48:17Z

What does this PR do?

As per title.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

HuggingFaceDocBuilderDev · 2024-07-26T22:37:16Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

avbodas · 2024-07-30T05:28:44Z

I tested this PR with text-generation example with this command with 1.16.2 docker image:
$ python run_generation.py --model_name_or_path meta-llama/Meta-Llama-3.1-8B --max_new_tokens 4096 --bf16 --use_hpu_graphs --use_kv_cache --batch_size 10 --attn_softmax_bf16 --limit_hpu_graphs --reuse_cache --trim_logits

it fails with log:
07/30/2024 04:45:57 - INFO - main - Args: Namespace(device='hpu', model_name_or_path='meta-llama/Meta-Llama-3.1-8B', bf16=True, max_new_tokens=4096, max_input_tokens=0, batch_size=10, warmup=3, n_iterations=5, local_rank=0, use_kv_cache=True, use_hpu_graphs=True, dataset_name=None, column_name=None, do_sample=False, num_beams=1, trim_logits=True, seed=27, profiling_warmup_steps=0, profiling_steps=0, profiling_record_shapes=False, prompt=None, bad_words=None, force_words=None, assistant_model=None, peft_model=None, num_return_sequences=1, token=None, model_revision='main', attn_softmax_bf16=True, output_dir=None, bucket_size=-1, bucket_internal=False, dataset_max_samples=-1, limit_hpu_graphs=True, reuse_cache=True, verbose_workers=False, simulate_dyn_prompt=None, reduce_recompile=False, use_flash_attention=False, flash_attention_recompute=False, flash_attention_causal_mask=False, flash_attention_fast_softmax=False, book_source=False, torch_compile=False, ignore_eos=True, temperature=1.0, top_p=1.0, const_serialization_path=None, disk_offload=False, trust_remote_code=False, quant_config='', world_size=0, global_rank=0)
07/30/2024 04:45:57 - INFO - main - device: hpu, n_hpu: 0, bf16: True
07/30/2024 04:45:57 - INFO - main - Model initialization took 10.293s
07/30/2024 04:45:57 - INFO - main - Graph compilation...
Warming up
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:567: UserWarning: do_sample is set to False. However, temperature is set to 0.6 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset temperature.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:572: UserWarning: do_sample is set to False. However, top_p is set to 0.9 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_p.
warnings.warn(
Traceback (most recent call last):
File "/optimum-habana/examples/text-generation/run_generation.py", line 666, in
main()
File "/optimum-habana/examples/text-generation/run_generation.py", line 433, in main
generate(None, args.reduce_recompile)
File "/optimum-habana/examples/text-generation/run_generation.py", line 404, in generate
outputs = model.generate(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/generation/utils.py", line 1276, in generate
result = self._sample(
File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/generation/utils.py", line 1767, in _sample
outputs = self(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1523, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 716, in forward
return wrapped_hpugraph_forward(
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 570, in wrapped_hpugraph_forward
return orig_fwd(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/llama/modeling_llama.py", line 1092, in forward
outputs = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1564, in _call_impl
result = forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/llama/modeling_llama.py", line 904, in forward
if isinstance(past_key_values[0][0], torch.Tensor):
File "/usr/local/lib/python3.10/dist-packages/transformers/cache_utils.py", line 314, in getitem
raise KeyError(f"Cache only has {len(self)} layers, attempted to access layer with index {layer_idx}")
KeyError: 'Cache only has 0 layers, attempted to access layer with index 0'

regisss · 2024-07-30T16:38:05Z

@avbodas It should work with the latest commit I pushed

imangohari1

Hi @regisss
I added some comments in below based on the CI tests that were ran against this PR at

33mcommit 49d877437cdc6c6c6ecc72bd496495fdc135acc6 (HEAD -> transformers_4.43, orig �in/transformers_4.43)

Author: regisss <15324346+regisss@users.noreply.github.com>

Date:   Fri Jul 26 22:32:39 2024 +0000


    Fixes

They might not all be relevant or useful, but was wondering about your thoughts.

other observations:

we see this error related to peft but the installed version is peft 0.12.0 (satisfying >.10.0). Maybe 0.12.0 is not compatible?

[rank5]:   File "/root/optimum-habana/examples/language-modeling/run_lora_clm.py", line 841, in main
[rank5]:     trainer.accelerator.state.fsdp_plugin.auto_wrap_policy = fsdp_auto_wrap_policy(lora_model)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/peft/utils/other.py", line 396, in fsdp_auto_wrap_policy
[rank5]:     transformer_cls = FullyShardedDataParallelPlugin.get_module_class_from_name(model, layer_class)
[rank5]: AttributeError: type object 'FullyShardedDataParallelPlugin' has no attribute 'get_module_class_from_name`

for bridge tower, we see this issue on remote data access. Does this need a fix too?

nk3]:   File "/root/optimum-habana/examples/contrastive-image-text/run_bridgetower.py", line 625, in <module>
[rank3]:     main()
[rank3]:   File "/root/optimum-habana/examples/contrastive-image-text/run_bridgetower.py", line 323, in main
[rank3]:     dataset = load_dataset(
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 2594, in load_dataset
[rank3]:     builder_instance = load_dataset_builder(
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 2266, in load_dataset_builder
[rank3]:     dataset_module = dataset_module_factory(
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 1914, in dataset_module_factory
[rank3]:     raise e1 from None
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 1887, in dataset_module_factory
[rank3]:     ).get_module()
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 1538, in get_module
[rank3]:     raise ValueError(
[rank3]: ValueError: Loading jmhessel/newyorker_caption_contest requires you to execute the dataset script in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option `trust_remote_code=True` to remove this error.

optimum/habana/transformers/models/clip/modeling_clip.py

optimum/habana/transformers/models/gpt_neox/modeling_gpt_neox.py

optimum/habana/transformers/models/llama/modeling_llama.py

optimum/habana/transformers/models/llama/configuration_llama.py

vidyasiv · 2024-07-30T23:06:14Z

@regisss , I see failures for python -m pytest tests/transformers/tests/models -v -s like so:
=========================== short test summary info ============================

ERROR tests/transformers/tests/models/albert/test_modeling_albert.py
ERROR tests/transformers/tests/models/bert/test_modeling_bert.py
ERROR tests/transformers/tests/models/bridgetower/test_modeling_bridgetower.py
ERROR tests/transformers/tests/models/distilbert/test_modeling_distilbert.py
ERROR tests/transformers/tests/models/gpt2/test_modeling_gpt2.py
ERROR tests/transformers/tests/models/gptj/test_modeling_gptj.py
ERROR tests/transformers/tests/models/roberta/test_modeling_roberta.py
ERROR tests/transformers/tests/models/swin/test_modeling_swin.py
ERROR tests/transformers/tests/models/t5/test_modeling_t5.py
ERROR tests/transformers/tests/models/vit/test_modeling_vit.py
!!!!!!!!!!!!!!!!!!! Interrupted: 10 errors during collection !!!!!!!!!!!!!!!!!!!
======================== 5 warnings, 10 errors in 6.80s ========================

In my debug I found the following:

Each model has an error like ImportError: cannot import name 'VIT_PRETRAINED_MODEL_ARCHIVE_LIST' from 'transformers.models.vit.modeling_vit' (/usr/local/lib/python3.10/dist-packages/transformers/models/vit/modeling_vit.py) which seems related to this PR where those deprecated variables all got finally removed.

Per transformers tests example:
https://github.com/huggingface/transformers/blob/v4.43.3/tests/models/vit/test_modeling_vit.py#L250-L254
, they appear to now use specific models instead of the archive list variable so a fix could resemble:

+++ b/tests/transformers/tests/models/vit/test_modeling_vit.py
@@ -37,7 +37,6 @@ if is_torch_available():
    import torch
    from torch import nn
    from transformers import ViTForImageClassification, ViTForMaskedImageModeling, ViTModel
-    from transformers.models.vit.modeling_vit import VIT_PRETRAINED_MODEL_ARCHIVE_LIST
if is_vision_available():
@@ -245,9 +244,9 @@ class ViTModelTest(ModelTesterMixin, unittest.TestCase):
    @slow
    def test_model_from_pretrained(self):
-        for model_name in VIT_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
-            model = ViTModel.from_pretrained(model_name)
-            self.assertIsNotNone(model)
+        model_name = "google/vit-base-patch16-224"
+        model = ViTModel.from_pretrained(model_name)
+        self.assertIsNotNone(model)

I tested that it will get past the error and run the tests.

regisss · 2024-07-31T14:04:05Z

@imangohari1 It's solved for PEFT and BridgeTower. I'll have to check that the other examples with PEFT work with v0.12.

@vidyasiv I added the changes for all these tests.

yafshar · 2024-07-31T16:07:25Z

@regisss please include #1086 here so FSDP tests works as expected!

imangohari1 · 2024-07-31T19:44:13Z

@imangohari1 It's solved for PEFT and BridgeTower. I'll have to check that the other examples with PEFT work with v0.12.

@vidyasiv I added the changes for all these tests.

Thank you! I have started doing some subset testing and will do another integrated CI job.
Meanwhile I am attaching a patch that fixes some of the failures in diffuser tests.
They should be all functional as I tested them with your changes (details below).

0001-fea-diffuser-tests-Updated-the-tests-for-4.43.patch

Please review them and apply em with git am < 0001* and we can see the results in the next CI job.

diffuser failed cases

FAILED tests/test_diffusers.py::GaudiStableDiffusionXLPipelineTester::test_stable_diffusion_xl_euler_ancestral - AssertionError: 0.032823339271545404 not less than 0.01
FAILED tests/test_diffusers.py::GaudiStableDiffusionXLPipelineTester::test_stable_diffusion_xl_turbo_euler_ancestral - AssertionError: 0.032823339271545404 not less than 0.01
FAILED tests/test_diffusers.py::GaudiStableVideoDiffusionPipelineTester::test_stable_video_diffusion_single_video - AssertionError: tensor(0.0375, dtype=torch.float64) not less than 0.01
FAILED tests/test_diffusers.py::StableDiffusionXLInpaintPipelineFastTests::test_stable_diffusion_xl_inpaint_euler - AssertionError: assert 0.03283219404220583 < 0.01
FAILED tests/test_diffusers.py::StableDiffusionXLInpaintPipelineFastTests::test_stable_diffusion_xl_refiner - AssertionError: assert 0.05005731153488158 < 0.0

I updated two from here: https://github.com/huggingface/diffusers/blob/main/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl.py#L181-L196

The other 3 are own tests without any upstream version.
For those, we got the expected values by importing the original SVD and SDXL pipilines and got the values., i.e.

diff --git a/tests/test_diffusers.py b/tests/test_diffusers.py
index 1def508..b4c0a92 100755
--- a/tests/test_diffusers.py
+++ b/tests/test_diffusers.py
@@ -42,6 +42,9 @@ from diffusers import (
     DiffusionPipeline,
     DPMSolverMultistepScheduler,
     EulerDiscreteScheduler,
+    EulerAncestralDiscreteScheduler,
+    StableDiffusionXLPipeline,
+    StableVideoDiffusionPipeline,
     LCMScheduler,
     PNDMScheduler,
     UNet2DConditionModel,
@@ -958,7 +961,8 @@ class GaudiStableDiffusionXLPipelineTester(TestCase):
         device = "cpu"  # ensure determinism for the device-dependent torch.Generator
         components = self.get_dummy_components()
         gaudi_config = GaudiConfig(use_torch_autocast=False)
-        sd_pipe = GaudiStableDiffusionXLPipeline(use_habana=True, gaudi_config=gaudi_config, **components)
+        sd_pipe = StableDiffusionXLPipeline( **components)
+        #sd_pipe = GaudiStableDiffusionXLPipeline(use_habana=True, gaudi_config=gaudi_config, **components)
         sd_pipe.set_progress_bar_config(disable=None)
 
         inputs = self.get_dummy_inputs(device)
@@ -977,8 +981,10 @@ class GaudiStableDiffusionXLPipelineTester(TestCase):
         device = "cpu"  # ensure determinism for the device-dependent torch.Generator
         components = self.get_dummy_components()
         gaudi_config = GaudiConfig(use_torch_autocast=False)
-        sd_pipe = GaudiStableDiffusionXLPipeline(use_habana=True, gaudi_config=gaudi_config, **components)
-        sd_pipe.scheduler = GaudiEulerAncestralDiscreteScheduler.from_config(sd_pipe.scheduler.config)
+        #sd_pipe = GaudiStableDiffusionXLPipeline(use_habana=True, gaudi_config=gaudi_config, **components)
+        #sd_pipe.scheduler = GaudiEulerAncestralDiscreteScheduler.from_config(sd_pipe.scheduler.config)
+        sd_pipe = StableDiffusionXLPipeline(**components)
+        sd_pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(sd_pipe.scheduler.config)
         sd_pipe.set_progress_bar_config(disable=None)
 
         inputs = self.get_dummy_inputs(device)
@@ -2196,7 +2202,8 @@ class GaudiStableVideoDiffusionPipelineTester(TestCase):
         device = "cpu"  # ensure determinism for the device-dependent torch.Generator
         components = self.get_dummy_components()
         gaudi_config = GaudiConfig(use_torch_autocast=False)
-        sd_pipe = GaudiStableVideoDiffusionPipeline(use_habana=True, gaudi_config=gaudi_config, **components)
+        sd_pipe = StableVideoDiffusionPipeline( **components)
+        #sd_pipe = GaudiStableVideoDiffusionPipeline(use_habana=True, gaudi_config=gaudi_config, **components)
         for component in sd_pipe.components.values():
             if hasattr(component, "set_default_attn_processor"):
                 component.set_default_attn_processor()

I talked to both @dsocek and @skavulya. We might want to consider replacing these with some upstream tests, or automate them to get the expected values during the runtime. The latter case would increase the runtime.

Results (with patch)

python -m pytest tests/test_diffusers.py -s -v -k test_stable_video_diffusion_single_video
.
.
================ 1 passed, 140 deselected, 6 warnings in 9.43s =================

python -m pytest tests/test_diffusers.py -s -v -k test_stable_diffusion_xl_
.
.
==== 21 passed, 1 skipped, 119 deselected, 18 warnings in 202.51s (0:03:22) ====

imangohari1

Hi @regisss
Thanks for adding the remote trust to the tests.
Minor fixes for whisper and wav2vec as well in attached patch.
I tested them and they are working.
0002-fea-added-trust-remote-for-whisper-and-wav2vec2.patch

I also have seen this repeated error on llava_next that would like your intake on.

optimum/habana/transformers/models/llava_next/modeling_llava_next.py

regisss · 2024-08-01T14:43:45Z

@yafshar Done!

@imangohari1 I added the patches and solved the issue for Llava-next. Don't hesitate to open PRs to merge your patches into this branch if you have other changes to sugest 🙂

regisss · 2024-08-01T14:51:43Z

@vivekgoe Regarding the Transformers 4.43 PR, the BERT FSDP test now fails because the trainer tries to save the FSDP model at the end of training. It didn't use to happen with Transformers 4.40, but still we should be able to save the FSDP model.
Sharing here the error I get:

File "/root/workspace/optimum-habana/examples/question-answering/run_qa.py", line 679, in main                                             
    train_result = trainer.train(resume_from_checkpoint=checkpoint)                                                                          
  File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 553, in train                                  
    return inner_training_loop(                                                                                                              
  File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1052, in _inner_training_loop                  
    self._maybe_log_save_evaluate(tr_loss, _grad_norm, model, trial, epoch, ignore_keys_for_eval)                                            
  File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1269, in _maybe_log_save_evaluate              
    self._save_checkpoint(model, trial, metrics=metrics)                                                                                     
  File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1331, in _save_checkpoint                      
    self._save_optimizer_and_scheduler(output_dir)                                                                                           
  File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1412, in _save_optimizer_and_scheduler         
    save_fsdp_optimizer(                                                                                                                     
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/fsdp_utils.py", line 168, in save_fsdp_optimizer                            
    optim_state = FSDP.optim_state_dict(model, optimizer)                                                                                    
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1838, in optim_state_dict       
    state_dict_settings.optim_state_dict_config, "rank0_only", False                                                                         
AttributeError: 'NoneType' object has no attribute 'optim_state_dict_config'

The issue is that fsdp_modules returns an empty list. Any idea why?

regisss · 2024-08-01T14:55:56Z

@imangohari1 I see that the Llama fp8 text-generation tests failed with:

RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_LAZY Error, ValidateSyncInputTensors tensor_data is empty. Tensorid:18390 QueueStatus:ThreadPool m_tasks size: 0 irValue:id_46126_model/hpu__input

Do you get the same?

emascarenhas · 2024-08-01T16:39:57Z

@imangohari1 I see that the Llama fp8 text-generation tests failed with:
RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_LAZY Error, ValidateSyncInputTensors tensor_data is empty. Tensorid:18390 QueueStatus:ThreadPool m_tasks size: 0 irValue:id_46126_model/hpu__input
Do you get the same?

Yes we get the same error.

imangohari1 · 2024-08-02T04:05:33Z

@imangohari1 I see that the Llama fp8 text-generation tests failed with:
RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_LAZY Error, ValidateSyncInputTensors tensor_data is empty. Tensorid:18390 QueueStatus:ThreadPool m_tasks size: 0 irValue:id_46126_model/hpu__input
Do you get the same?

Hi @regisss ,
Thank you for the help thus far. I will open PRs moving forward.

Here are my comments/questions/updates from my today's tests:

Wrt the text-gen ValidateSyncInputTensors tensor_data is empty:
- I do see the same issue with 1.17 driver/1.17contianer, but not on 1.16driver/1.17 container.
- I tested test_text_generation_fp8[token0-meta-llama/Llama-2-7b-hf-1-163-False-128-2048-4774.7] from OH main branch on 1.17 and it crashes with RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_SYNHELPER workspace Allocation of size ::1470892032 failed!
  - I reduced the bs on this experiment to 128 and it passed the OOM.
- I (re)tested this branch with lowered bs=128, and =32 for test_text_generation_fp8[token0-meta-llama/Llama-2-7b-hf and still crashes with the same tensor_data is empty error.
- Might need to dig a bit on this.
speech-recognition-multicard passes the functional failures. thanks!
The llava_next tests are now passing the functional failures. thanks!
I have ran the 5 (previously failed) tests that had remote_trust errors locally, and 4 pass the previous failures now, but bridgetower_bridgetower-large-itm-mlm-itc_multi_card.log is failing with the below issue. I will try to see if I can root cause this.

[rank3]:     x = self.transforms(x)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank3]:     result = forward_call(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 217, in forward
[rank3]:     input = module(input)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank3]:     result = forward_call(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torchvision/transforms/transforms.py", line 354, in forward
[rank3]:     return F.resize(img, self.size, self.interpolation, self.max_size, self.antialias)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torchvision/transforms/functional.py", line 456, in resize
[rank3]:     _, image_height, image_width = get_dimensions(img)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torchvision/transforms/functional.py", line 80, in get_dimensions
[rank3]:     return F_pil.get_dimensions(img)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torchvision/transforms/_functional_pil.py", line 31, in get_dimensions
[rank3]:     raise TypeError(f"Unexpected type {type(img)}")
[rank3]: TypeError: Unexpected type <class 'NoneType'>

The deepspeed sumerization task with flan-t5-xxl is failing the test since the output of model loss is 0. I checked this with 1.16 driver/1.16 container and got the same results. Is there something we need to update for this?

deepspeed --num_nodes 1 --num_gpus 8 run_summarization.py --model_name_or_path google/flan-t5-xxl --gaudi_config_name Habana/t5 --dataset_name cnn_dailymail --do_train --output_dir /tmp/tmpy1_wto8t --overwrite_output_dir --learning_rate 0.0001 --per_device_train_batch_size 22 --per_device_eval_batch_size 22 --num_train_epochs 2 --use_habana --throughput_warmup_steps 3 --save_strategy no --use_lazy_mode --do_eval --max_steps 10 --max_eval_samples 880 --dataset_config 3.0.0 --source_prefix summarize: --predict_with_generate --ignore_pad_token_for_loss False --pad_to_max_length --generation_max_length 129 --gradient_checkpointing --adam_epsilon 1e-08 --deepspeed ds_flan_t5_z3_config_bf16.json
.
.
***** eval metrics *****
  epoch                       =     0.0061
  eval_gen_len                =        0.0
  eval_loss                   =     0.9052
  eval_rouge1                 =        0.0 << here
  eval_rouge2                 =        0.0 << here
  eval_rougeL                 =        0.0  << here
  eval_rougeLsum              =        0.0 << here
  eval_runtime                = 0:03:16.03
  eval_samples                =        880
  eval_samples_per_second     =       4.73
  eval_steps_per_second       =      0.027
  max_memory_allocated (GB)   =      94.59
  memory_allocated (GB)       =       28.2
  total_memory_available (GB) =      94.62

    assert_function(
E   AssertionError: 0.0 not greater than or equal to 0.14147099999999999 : for metric eval_rougeLsum. 
E   ===== Assessed metrics (measured vs thresholded baseline) =====
E   eval_rougeLsum: 0.0 vs 0.14147099999999999
E   train_runtime: 92.3056 vs 93.9603
E   train_samples_per_second: 27.418 vs 25.93405

Thank you for pushing the fix for deprecated XX_ARCHIVE_LIS flags. Now our transformer tests, i.e. python -m pytest tests/transformers/tests/models -v -s , is the post populated one with +100 failures.
Below is the category of the tests. I understand those have not been updated, but would you like us to help with them? if so, could you point us where to start finding the equivalents for changes made in Upgrade to Transformers 4.40 #1027 (4.40 update). Thank you.
- has no attribute: 81
- got an unexpected keyword argument 'token_idx' errors: 16
- required positional arguments: 14
- Not Implemented errors: 3
  Examples

FAILED tests/transformers/tests/models/bert/test_modeling_bert.py::BertModelTest::test_beam_search_generate - AttributeError: 'BertLMHeadModel' object has no attribute 'beam_search'. Di...
FAILED tests/transformers/tests/models/bert/test_modeling_bert.py::BertModelTest::test_beam_search_generate_dict_output - AttributeError: 'BertLMHeadModel' object has no attribute 'beam_search'. Di...
FAILED tests/transformers/tests/models/bert/test_modeling_bert.py::BertModelTest::test_beam_search_generate_dict_outputs_use_cache - AttributeError: 'BertLMHeadModel' object has no attribute 'beam_search'. Di...
FAILED tests/transformers/tests/models/bert/test_modeling_bert.py::BertModelTest::test_constrained_beam_search_generate - AttributeError: 'BertLMHeadModel' object has no attribute 'constrained_beam...
FAILED tests/transformers/tests/models/bert/test_modeling_bert.py::BertModelTest::test_constrained_beam_search_generate_dict_output - AttributeError: 'BertLMHeadModel' object has no attribute 'constrained_beam...
FAILED tests/transformers/tests/models/bert/test_modeling_bert.py::BertModelTest::test_contrastive_generate - TypeError: GenerationMixin._contrastive_search() missing 3 required positio...
FAILED tests/transformers/tests/models/bert/test_modeling_bert.py::BertModelTest::test_contrastive_generate_dict_outputs_use_cache - TypeError: GenerationMixin._contrastive_search() missing 3 required positio...
FAILED tests/transformers/tests/models/bert/test_modeling_bert.py::BertModelTest::test_greedy_generate - AttributeError: 'BertLMHeadModel' object has no attribute 'greedy_search'

regisss · 2024-08-02T13:22:54Z

@imangohari1

For BridgeTower, it seems the mediapipe dataloader fails at some point because it's falling back on Torch dataloader (which should also work...). Maybe this PR can help?
For flan-T5, hard to say. It looks to me like a data type error, I'll check later.
For Transformers tests, I think you can just compare with https://github.com/huggingface/transformers/tree/v4.43.3/tests/models. From what I see, many errors you get are due to calling sample (or another decoding strategy) and not _sample. And greedy_search doesn't exist anymore, it's a special case of sampling (see

optimum-habana/optimum/habana/transformers/generation/utils.py

Line 1277 in 826b666

# 13. run sample (it degenerates to greedy search when `generation_config.do_sample=False`)

).

regisss · 2024-08-07T12:31:12Z

Merging as this PR is becoming too big. I'll open a new PR for Synapse 1.17 specific changes. Let's open new PRs for other fixes, including AFAIK:

Fix FSDP test, I'll do it
Fix Flan-T5 DeepSpeed test
There are issues with Llama 3.1

regisss · 2024-08-07T12:31:52Z

Measure the model on a number of cards that are enough for the model to fit in BF16.
Quantize the model on the same amount of cards for scales to be saved.
Run unify_measurements.py script using the measurement files created after running steps 1 and 2. A unified measurement is then calculated.
Are these steps required for models that are already quantized to fp8 e.g. llama 405b FP8? https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct-FP8

I don't have two Gaudi boxes available, so I don't know if I should defer this to your team.

@endomorphosis I'm not sure the fp8 checkpoints are compatible with how we do fp8 quantization on Gaudi

imangohari1 · 2024-08-07T19:39:06Z

Merging as this PR is becoming too big. I'll open a new PR for Synapse 1.17 specific changes. Let's open new PRs for other fixes, including AFAIK:
* Fix FSDP test, I'll do it

* Fix Flan-T5 DeepSpeed test

* There are issues with Llama 3.1

for * Fix Flan-T5 DeepSpeed test: #1224

regisss · 2024-08-07T20:04:03Z

Merging as this PR is becoming too big. I'll open a new PR for Synapse 1.17 specific changes. Let's open new PRs for other fixes, including AFAIK:
* Fix FSDP test, I'll do it

* Fix Flan-T5 DeepSpeed test

* There are issues with Llama 3.1
for * Fix Flan-T5 DeepSpeed test: #1224

So we just needed to train the model a bit more?

imangohari1 · 2024-08-07T20:08:04Z

Merging as this PR is becoming too big. I'll open a new PR for Synapse 1.17 specific changes. Let's open new PRs for other fixes, including AFAIK:
* Fix FSDP test, I'll do it

* Fix Flan-T5 DeepSpeed test

* There are issues with Llama 3.1
for * Fix Flan-T5 DeepSpeed test: #1224
So we just needed to train the model a bit more?

Yes.
@yeonsily cross validated a longer training run with R&D results and they were in agreement.
Adding few more steps here were sufficient to get pass through the failure.

Sanatan-Shrivastava · 2024-08-12T20:59:04Z

examples/audio-classification/run_audio_classification.py

-check_min_version("4.40.0")
-check_optimum_habana_min_version("1.11.0")
+check_min_version("4.43.0")
+check_optimum_habana_min_version("1.12.0")


Hi, I am trying to run this test with versions (optimum-habana==1.12.1, transformers==4.43.0) but encountered the following error:

ERROR: Cannot install optimum-habana==1.12.1 and transformers==4.43.0 because these package versions have conflicting dependencies.\n#8 20.54 \n#8 20.54 The conflict is caused by:\n#8 20.54 The user requested transformers==4.43.0\n#8 20.54 optimum-habana 1.12.1 depends on transformers<4.41.0 and >=4.40.0

I also tried with optimum-habana version 1.12.0 and encountered the same error.

Can someone please point me to the correction PR or if there is any documentation to fix this?
Thanks!

You should pip install optimimum habana from git, because this fork is ahead of 1.12.1, and have different dependencies, and this should be listed as 1.13.0.dev as the version number.

Yeah it should be check_optimum_habana_min_version("1.13.0.dev0"), I'll add a script to do that automatically.

You should pip install optimimum habana from git, because this fork is ahead of 1.12.1, and have different dependencies, and this should be listed as 1.13.0.dev as the version number.

Ok great, Thanks for the info. So, currently, I am using pip install git+https://github.com/huggingface/optimum-habana.git@{{ optimum_habana_version }}. Would using pip install git+https://github.com/huggingface/optimum-habana.git to use the latest version that is compatible with transformers==4.43.0 work for future transformers versions?

It may not work with future Transformers versions as there might be changes that are not compatible with what we do here in Optimum Habana.
In the coming weeks, I will open a new branch and try to maintain it so that new Transformers releases are supported but with potential perf regressions (which will be solved once it comes to the main branch).

Thanks @regisss, that would be helpful.
Until then, I will be using transformers==4.43.0 while doing pip install git+https://github.com/huggingface/optimum-habana.git. It worked for me yesterday without specifying the optimum-habana version.
Can you please confirm if this^ would continue working for 4.43.0 only, please?
For future transformers' version - I'll be keep an eye out for the new branch you mentioned.

Yes, the main branch should work for Transformers 4.43.x till the next time we align the lib with a new version of Transformers.

regisss added 7 commits July 24, 2024 17:50

Update examples

1390282

Add changes for generate

60d2a5a

Model changes

e06c663

Make style

ace3e10

Upgrade Accelerate

3b15d6d

Fix setup.py

f85ce43

Fixes

49d8774

regisss mentioned this pull request Jul 29, 2024

Llama 3.1 Support -- Rope_scaling issue #1154

Closed

4 tasks

regisss added 2 commits July 30, 2024 13:58

Merge branch 'main' into transformers_4.43

7cf0f0a

Model fixes

3a0be9f

imangohari1 reviewed Jul 30, 2024

View reviewed changes

regisss added 2 commits July 31, 2024 09:05

Fix FSDP

4cf3dbb

Other model fixes

0cbb67c

imangohari1 reviewed Aug 1, 2024

View reviewed changes

optimum/habana/transformers/models/llava_next/modeling_llava_next.py Outdated Show resolved Hide resolved

regisss added 2 commits August 1, 2024 08:57

Merge branch 'main' into transformers_4.43

c5418b3

Model fixes

820eb7d

Merge branch 'main' into transformers_4.43

826b666

regisss requested review from libinta, dvarshney-habana and ZhaiFeiyue as code owners August 7, 2024 12:16

regisss merged commit 34c780e into main Aug 7, 2024
7 checks passed

regisss deleted the transformers_4.43 branch August 7, 2024 12:29

mgonchar mentioned this pull request Aug 7, 2024

gpt_bigcode: added internal_bucketing support #1218

Merged

3 tasks

endomorphosis mentioned this pull request Aug 8, 2024

Integration of llama3.1 fixes huggingface/tgi-gaudi#197

Closed

regisss mentioned this pull request Aug 8, 2024

Qwen2-72B inference on 8x Gaudi2 gets OOM issue due to missing meta-device support on model loading #1112

Closed

4 tasks

imangohari1 mentioned this pull request Aug 8, 2024

fix(test_diffusers): automated the checking for tests without upstream HF #1232

Merged

3 tasks

Sanatan-Shrivastava reviewed Aug 12, 2024

View reviewed changes

astachowiczhabana mentioned this pull request Aug 27, 2024

Diffuser change upgraded to 0.26.3 along with MLPERF SD XL support HabanaAI/optimum-habana-fork#226

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade to Transformers 4.43 #1163

Upgrade to Transformers 4.43 #1163

regisss commented Jul 26, 2024

HuggingFaceDocBuilderDev commented Jul 26, 2024

avbodas commented Jul 30, 2024

regisss commented Jul 30, 2024

imangohari1 left a comment •

edited

Loading

vidyasiv commented Jul 30, 2024

regisss commented Jul 31, 2024

yafshar commented Jul 31, 2024

imangohari1 commented Jul 31, 2024

imangohari1 left a comment

regisss commented Aug 1, 2024

regisss commented Aug 1, 2024

regisss commented Aug 1, 2024

emascarenhas commented Aug 1, 2024

imangohari1 commented Aug 2, 2024

regisss commented Aug 2, 2024

regisss commented Aug 7, 2024

regisss commented Aug 7, 2024

imangohari1 commented Aug 7, 2024

regisss commented Aug 7, 2024

imangohari1 commented Aug 7, 2024

Sanatan-Shrivastava Aug 12, 2024 •

edited

Loading

endomorphosis Aug 12, 2024

regisss Aug 12, 2024

Sanatan-Shrivastava Aug 12, 2024

regisss Aug 13, 2024

Sanatan-Shrivastava Aug 13, 2024 •

edited

Loading

regisss Aug 13, 2024

Upgrade to Transformers 4.43 #1163

Upgrade to Transformers 4.43 #1163

Conversation

regisss commented Jul 26, 2024

What does this PR do?

Before submitting

HuggingFaceDocBuilderDev commented Jul 26, 2024

avbodas commented Jul 30, 2024

regisss commented Jul 30, 2024

imangohari1 left a comment • edited Loading

Choose a reason for hiding this comment

vidyasiv commented Jul 30, 2024

regisss commented Jul 31, 2024

yafshar commented Jul 31, 2024

imangohari1 commented Jul 31, 2024

diffuser failed cases

Results (with patch)

imangohari1 left a comment

Choose a reason for hiding this comment

regisss commented Aug 1, 2024

regisss commented Aug 1, 2024

regisss commented Aug 1, 2024

emascarenhas commented Aug 1, 2024

imangohari1 commented Aug 2, 2024

regisss commented Aug 2, 2024

regisss commented Aug 7, 2024

regisss commented Aug 7, 2024

imangohari1 commented Aug 7, 2024

regisss commented Aug 7, 2024

imangohari1 commented Aug 7, 2024

Sanatan-Shrivastava Aug 12, 2024 • edited Loading

Choose a reason for hiding this comment

endomorphosis Aug 12, 2024

Choose a reason for hiding this comment

regisss Aug 12, 2024

Choose a reason for hiding this comment

Sanatan-Shrivastava Aug 12, 2024

Choose a reason for hiding this comment

regisss Aug 13, 2024

Choose a reason for hiding this comment

Sanatan-Shrivastava Aug 13, 2024 • edited Loading

Choose a reason for hiding this comment

regisss Aug 13, 2024

Choose a reason for hiding this comment

imangohari1 left a comment •

edited

Loading

Sanatan-Shrivastava Aug 12, 2024 •

edited

Loading

Sanatan-Shrivastava Aug 13, 2024 •

edited

Loading