-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade to Transformers 4.43 #1163
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
I tested this PR with text-generation example with this command with 1.16.2 docker image: it fails with log: |
@avbodas It should work with the latest commit I pushed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @regisss
I added some comments in below based on the CI tests that were ran against this PR at
33mcommit 49d877437cdc6c6c6ecc72bd496495fdc135acc6 (HEAD -> transformers_4.43, orig �in/transformers_4.43)
Author: regisss <15324346+regisss@users.noreply.github.com>
Date: Fri Jul 26 22:32:39 2024 +0000
Fixes
They might not all be relevant or useful, but was wondering about your thoughts.
other observations:
- we see this error related to peft but the installed version is
peft 0.12.0
(satisfying >.10.0). Maybe 0.12.0 is not compatible?
[rank5]: File "/root/optimum-habana/examples/language-modeling/run_lora_clm.py", line 841, in main
[rank5]: trainer.accelerator.state.fsdp_plugin.auto_wrap_policy = fsdp_auto_wrap_policy(lora_model)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/peft/utils/other.py", line 396, in fsdp_auto_wrap_policy
[rank5]: transformer_cls = FullyShardedDataParallelPlugin.get_module_class_from_name(model, layer_class)
[rank5]: AttributeError: type object 'FullyShardedDataParallelPlugin' has no attribute 'get_module_class_from_name`
- for bridge tower, we see this issue on remote data access. Does this need a fix too?
nk3]: File "/root/optimum-habana/examples/contrastive-image-text/run_bridgetower.py", line 625, in <module>
[rank3]: main()
[rank3]: File "/root/optimum-habana/examples/contrastive-image-text/run_bridgetower.py", line 323, in main
[rank3]: dataset = load_dataset(
[rank3]: File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 2594, in load_dataset
[rank3]: builder_instance = load_dataset_builder(
[rank3]: File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 2266, in load_dataset_builder
[rank3]: dataset_module = dataset_module_factory(
[rank3]: File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 1914, in dataset_module_factory
[rank3]: raise e1 from None
[rank3]: File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 1887, in dataset_module_factory
[rank3]: ).get_module()
[rank3]: File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 1538, in get_module
[rank3]: raise ValueError(
[rank3]: ValueError: Loading jmhessel/newyorker_caption_contest requires you to execute the dataset script in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option `trust_remote_code=True` to remove this error.
optimum/habana/transformers/models/gpt_neox/modeling_gpt_neox.py
Outdated
Show resolved
Hide resolved
@regisss , I see failures for ERROR tests/transformers/tests/models/albert/test_modeling_albert.py In my debug I found the following: Each model has an error like ImportError: cannot import name 'VIT_PRETRAINED_MODEL_ARCHIVE_LIST' from 'transformers.models.vit.modeling_vit' (/usr/local/lib/python3.10/dist-packages/transformers/models/vit/modeling_vit.py) which seems related to this PR where those deprecated variables all got finally removed. Per transformers tests example: +++ b/tests/transformers/tests/models/vit/test_modeling_vit.py
@@ -37,7 +37,6 @@ if is_torch_available():
import torch
from torch import nn
from transformers import ViTForImageClassification, ViTForMaskedImageModeling, ViTModel
- from transformers.models.vit.modeling_vit import VIT_PRETRAINED_MODEL_ARCHIVE_LIST
if is_vision_available():
@@ -245,9 +244,9 @@ class ViTModelTest(ModelTesterMixin, unittest.TestCase):
@slow
def test_model_from_pretrained(self):
- for model_name in VIT_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
- model = ViTModel.from_pretrained(model_name)
- self.assertIsNotNone(model)
+ model_name = "google/vit-base-patch16-224"
+ model = ViTModel.from_pretrained(model_name)
+ self.assertIsNotNone(model) I tested that it will get past the error and run the tests. |
@imangohari1 It's solved for PEFT and BridgeTower. I'll have to check that the other examples with PEFT work with v0.12. @vidyasiv I added the changes for all these tests. |
Thank you! I have started doing some subset testing and will do another integrated CI job. 0001-fea-diffuser-tests-Updated-the-tests-for-4.43.patch Please review them and apply em with diffuser failed cases
I updated two from here: https://github.com/huggingface/diffusers/blob/main/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl.py#L181-L196 The other 3 are own tests without any upstream version. diff --git a/tests/test_diffusers.py b/tests/test_diffusers.py
index 1def508..b4c0a92 100755
--- a/tests/test_diffusers.py
+++ b/tests/test_diffusers.py
@@ -42,6 +42,9 @@ from diffusers import (
DiffusionPipeline,
DPMSolverMultistepScheduler,
EulerDiscreteScheduler,
+ EulerAncestralDiscreteScheduler,
+ StableDiffusionXLPipeline,
+ StableVideoDiffusionPipeline,
LCMScheduler,
PNDMScheduler,
UNet2DConditionModel,
@@ -958,7 +961,8 @@ class GaudiStableDiffusionXLPipelineTester(TestCase):
device = "cpu" # ensure determinism for the device-dependent torch.Generator
components = self.get_dummy_components()
gaudi_config = GaudiConfig(use_torch_autocast=False)
- sd_pipe = GaudiStableDiffusionXLPipeline(use_habana=True, gaudi_config=gaudi_config, **components)
+ sd_pipe = StableDiffusionXLPipeline( **components)
+ #sd_pipe = GaudiStableDiffusionXLPipeline(use_habana=True, gaudi_config=gaudi_config, **components)
sd_pipe.set_progress_bar_config(disable=None)
inputs = self.get_dummy_inputs(device)
@@ -977,8 +981,10 @@ class GaudiStableDiffusionXLPipelineTester(TestCase):
device = "cpu" # ensure determinism for the device-dependent torch.Generator
components = self.get_dummy_components()
gaudi_config = GaudiConfig(use_torch_autocast=False)
- sd_pipe = GaudiStableDiffusionXLPipeline(use_habana=True, gaudi_config=gaudi_config, **components)
- sd_pipe.scheduler = GaudiEulerAncestralDiscreteScheduler.from_config(sd_pipe.scheduler.config)
+ #sd_pipe = GaudiStableDiffusionXLPipeline(use_habana=True, gaudi_config=gaudi_config, **components)
+ #sd_pipe.scheduler = GaudiEulerAncestralDiscreteScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe = StableDiffusionXLPipeline(**components)
+ sd_pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(sd_pipe.scheduler.config)
sd_pipe.set_progress_bar_config(disable=None)
inputs = self.get_dummy_inputs(device)
@@ -2196,7 +2202,8 @@ class GaudiStableVideoDiffusionPipelineTester(TestCase):
device = "cpu" # ensure determinism for the device-dependent torch.Generator
components = self.get_dummy_components()
gaudi_config = GaudiConfig(use_torch_autocast=False)
- sd_pipe = GaudiStableVideoDiffusionPipeline(use_habana=True, gaudi_config=gaudi_config, **components)
+ sd_pipe = StableVideoDiffusionPipeline( **components)
+ #sd_pipe = GaudiStableVideoDiffusionPipeline(use_habana=True, gaudi_config=gaudi_config, **components)
for component in sd_pipe.components.values():
if hasattr(component, "set_default_attn_processor"):
component.set_default_attn_processor() I talked to both @dsocek and @skavulya. We might want to consider replacing these with some upstream tests, or automate them to get the expected values during the runtime. The latter case would increase the runtime. Results (with patch)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @regisss
Thanks for adding the remote trust to the tests.
Minor fixes for whisper
and wav2vec
as well in attached patch.
I tested them and they are working.
0002-fea-added-trust-remote-for-whisper-and-wav2vec2.patch
I also have seen this repeated error on llava_next
that would like your intake on.
optimum/habana/transformers/models/llava_next/modeling_llava_next.py
Outdated
Show resolved
Hide resolved
@yafshar Done! @imangohari1 I added the patches and solved the issue for Llava-next. Don't hesitate to open PRs to merge your patches into this branch if you have other changes to sugest 🙂 |
@vivekgoe Regarding the Transformers 4.43 PR, the BERT FSDP test now fails because the trainer tries to save the FSDP model at the end of training. It didn't use to happen with Transformers 4.40, but still we should be able to save the FSDP model.
The issue is that fsdp_modules returns an empty list. Any idea why? |
@imangohari1 I see that the Llama fp8 text-generation tests failed with:
Do you get the same? |
Yes we get the same error. |
Hi @regisss , Here are my comments/questions/updates from my today's tests:
[rank3]: x = self.transforms(x)
[rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank3]: return self._call_impl(*args, **kwargs)
[rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank3]: result = forward_call(*args, **kwargs)
[rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 217, in forward
[rank3]: input = module(input)
[rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank3]: return self._call_impl(*args, **kwargs)
[rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank3]: result = forward_call(*args, **kwargs)
[rank3]: File "/usr/local/lib/python3.10/dist-packages/torchvision/transforms/transforms.py", line 354, in forward
[rank3]: return F.resize(img, self.size, self.interpolation, self.max_size, self.antialias)
[rank3]: File "/usr/local/lib/python3.10/dist-packages/torchvision/transforms/functional.py", line 456, in resize
[rank3]: _, image_height, image_width = get_dimensions(img)
[rank3]: File "/usr/local/lib/python3.10/dist-packages/torchvision/transforms/functional.py", line 80, in get_dimensions
[rank3]: return F_pil.get_dimensions(img)
[rank3]: File "/usr/local/lib/python3.10/dist-packages/torchvision/transforms/_functional_pil.py", line 31, in get_dimensions
[rank3]: raise TypeError(f"Unexpected type {type(img)}")
[rank3]: TypeError: Unexpected type <class 'NoneType'>
deepspeed --num_nodes 1 --num_gpus 8 run_summarization.py --model_name_or_path google/flan-t5-xxl --gaudi_config_name Habana/t5 --dataset_name cnn_dailymail --do_train --output_dir /tmp/tmpy1_wto8t --overwrite_output_dir --learning_rate 0.0001 --per_device_train_batch_size 22 --per_device_eval_batch_size 22 --num_train_epochs 2 --use_habana --throughput_warmup_steps 3 --save_strategy no --use_lazy_mode --do_eval --max_steps 10 --max_eval_samples 880 --dataset_config 3.0.0 --source_prefix summarize: --predict_with_generate --ignore_pad_token_for_loss False --pad_to_max_length --generation_max_length 129 --gradient_checkpointing --adam_epsilon 1e-08 --deepspeed ds_flan_t5_z3_config_bf16.json
.
.
***** eval metrics *****
epoch = 0.0061
eval_gen_len = 0.0
eval_loss = 0.9052
eval_rouge1 = 0.0 << here
eval_rouge2 = 0.0 << here
eval_rougeL = 0.0 << here
eval_rougeLsum = 0.0 << here
eval_runtime = 0:03:16.03
eval_samples = 880
eval_samples_per_second = 4.73
eval_steps_per_second = 0.027
max_memory_allocated (GB) = 94.59
memory_allocated (GB) = 28.2
total_memory_available (GB) = 94.62 assert_function(
E AssertionError: 0.0 not greater than or equal to 0.14147099999999999 : for metric eval_rougeLsum.
E ===== Assessed metrics (measured vs thresholded baseline) =====
E eval_rougeLsum: 0.0 vs 0.14147099999999999
E train_runtime: 92.3056 vs 93.9603
E train_samples_per_second: 27.418 vs 25.93405
FAILED tests/transformers/tests/models/bert/test_modeling_bert.py::BertModelTest::test_beam_search_generate - AttributeError: 'BertLMHeadModel' object has no attribute 'beam_search'. Di...
FAILED tests/transformers/tests/models/bert/test_modeling_bert.py::BertModelTest::test_beam_search_generate_dict_output - AttributeError: 'BertLMHeadModel' object has no attribute 'beam_search'. Di...
FAILED tests/transformers/tests/models/bert/test_modeling_bert.py::BertModelTest::test_beam_search_generate_dict_outputs_use_cache - AttributeError: 'BertLMHeadModel' object has no attribute 'beam_search'. Di...
FAILED tests/transformers/tests/models/bert/test_modeling_bert.py::BertModelTest::test_constrained_beam_search_generate - AttributeError: 'BertLMHeadModel' object has no attribute 'constrained_beam...
FAILED tests/transformers/tests/models/bert/test_modeling_bert.py::BertModelTest::test_constrained_beam_search_generate_dict_output - AttributeError: 'BertLMHeadModel' object has no attribute 'constrained_beam...
FAILED tests/transformers/tests/models/bert/test_modeling_bert.py::BertModelTest::test_contrastive_generate - TypeError: GenerationMixin._contrastive_search() missing 3 required positio...
FAILED tests/transformers/tests/models/bert/test_modeling_bert.py::BertModelTest::test_contrastive_generate_dict_outputs_use_cache - TypeError: GenerationMixin._contrastive_search() missing 3 required positio...
FAILED tests/transformers/tests/models/bert/test_modeling_bert.py::BertModelTest::test_greedy_generate - AttributeError: 'BertLMHeadModel' object has no attribute 'greedy_search' |
|
Merging as this PR is becoming too big. I'll open a new PR for Synapse 1.17 specific changes. Let's open new PRs for other fixes, including AFAIK:
|
@endomorphosis I'm not sure the fp8 checkpoints are compatible with how we do fp8 quantization on Gaudi |
for |
So we just needed to train the model a bit more? |
Yes. |
check_min_version("4.40.0") | ||
check_optimum_habana_min_version("1.11.0") | ||
check_min_version("4.43.0") | ||
check_optimum_habana_min_version("1.12.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I am trying to run this test with versions (optimum-habana==1.12.1, transformers==4.43.0) but encountered the following error:
ERROR: Cannot install optimum-habana==1.12.1 and transformers==4.43.0 because these package versions have conflicting dependencies.\n#8 20.54 \n#8 20.54 The conflict is caused by:\n#8 20.54 The user requested transformers==4.43.0\n#8 20.54 optimum-habana 1.12.1 depends on transformers<4.41.0 and >=4.40.0
I also tried with optimum-habana version 1.12.0 and encountered the same error.
Can someone please point me to the correction PR or if there is any documentation to fix this?
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should pip install optimimum habana from git, because this fork is ahead of 1.12.1, and have different dependencies, and this should be listed as 1.13.0.dev as the version number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it should be check_optimum_habana_min_version("1.13.0.dev0")
, I'll add a script to do that automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should pip install optimimum habana from git, because this fork is ahead of 1.12.1, and have different dependencies, and this should be listed as 1.13.0.dev as the version number.
Ok great, Thanks for the info. So, currently, I am using pip install git+https://github.com/huggingface/optimum-habana.git@{{ optimum_habana_version }}
. Would using pip install git+https://github.com/huggingface/optimum-habana.git
to use the latest version that is compatible with transformers==4.43.0 work for future transformers versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may not work with future Transformers versions as there might be changes that are not compatible with what we do here in Optimum Habana.
In the coming weeks, I will open a new branch and try to maintain it so that new Transformers releases are supported but with potential perf regressions (which will be solved once it comes to the main branch).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @regisss, that would be helpful.
Until then, I will be using transformers==4.43.0 while doing pip install git+https://github.com/huggingface/optimum-habana.git
. It worked for me yesterday without specifying the optimum-habana version.
Can you please confirm if this^ would continue working for 4.43.0 only, please?
For future transformers' version - I'll be keep an eye out for the new branch you mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the main branch should work for Transformers 4.43.x till the next time we align the lib with a new version of Transformers.
What does this PR do?
As per title.
Before submitting