Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error running BERT tagger on CoLi servers #4

Open
siyutao opened this issue Feb 14, 2022 · 16 comments
Open

Error running BERT tagger on CoLi servers #4

siyutao opened this issue Feb 14, 2022 · 16 comments
Assignees
Labels
bug Something isn't working wontfix This will not be worked on

Comments

@siyutao
Copy link
Contributor

siyutao commented Feb 14, 2022

Currently getting error while running the allennlp0.8 BERT config tagger/tagger_with_bert_config.json after changing the label_encoding to "BIO" ("BIOUL" throws a different error)
Error output:

Traceback (most recent call last):
  File "/proj/irtg.shadow/conda/envs/allennlp/bin/allennlp", line 10, in <module>
    sys.exit(run())
  File "/proj/irtg.shadow/conda/envs/allennlp/lib/python3.7/site-packages/allennlp/run.py", line 18, in run
    main(prog="allennlp")
  File "/proj/irtg.shadow/conda/envs/allennlp/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 102, in main
    args.func(args)
  File "/proj/irtg.shadow/conda/envs/allennlp/lib/python3.7/site-packages/allennlp/commands/train.py", line 116, in train_model_from_args
    args.cache_prefix)
  File "/proj/irtg.shadow/conda/envs/allennlp/lib/python3.7/site-packages/allennlp/commands/train.py", line 160, in train_model_from_file
    cache_directory, cache_prefix)
  File "/proj/irtg.shadow/conda/envs/allennlp/lib/python3.7/site-packages/allennlp/commands/train.py", line 243, in train_model
    metrics = trainer.train()
  File "/proj/irtg.shadow/conda/envs/allennlp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 480, in train
    train_metrics = self._train_epoch(epoch)
  File "/proj/irtg.shadow/conda/envs/allennlp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 322, in _train_epoch
    loss = self.batch_loss(batch_group, for_training=True)
  File "/proj/irtg.shadow/conda/envs/allennlp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 263, in batch_loss
    output_dict = self.model(**batch)
  File "/proj/irtg.shadow/conda/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/proj/irtg.shadow/conda/envs/allennlp/lib/python3.7/site-packages/allennlp/models/crf_tagger.py", line 182, in forward
    embedded_text_input = self.text_field_embedder(tokens)
  File "/proj/irtg.shadow/conda/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/proj/irtg.shadow/conda/envs/allennlp/lib/python3.7/site-packages/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 125, in forward
    return torch.cat(embedded_representations, dim=-1)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 2. Got 433 and 422 in dimension 1 at /pytorch/aten/src/THC/generic/THCTensorMath.cu:71

Another TO-DO: we need to add min_padding_length to config. May or may not be related to the current error.

/proj/irtg.shadow/conda/envs/allennlp/lib/python3.7/site-packages/allennlp/data/token_indexers/token_characters_indexer.py:55: UserWarning: You are using the default value (0) of `min_padding_length`, which can cause some subtle bugs (more info see https://github.com/allenai/allennlp/issues/1954). Strongly recommend to set a value, usually the maximum size of the convolutional layer size when using CnnEncoder.
@siyutao siyutao added bug Something isn't working help wanted Extra attention is needed labels Feb 14, 2022
@siyutao siyutao closed this as completed Feb 15, 2022
@siyutao siyutao reopened this Feb 15, 2022
@irisferrazzo
Copy link
Contributor

Hi, I've read that @siyutao you have many exams, should we say that you continue working on the allennlp 2.8 ELMo tagger and @TheresaSchmidt and I meet asap, run the allennlp 0.8 BERT tagger, and discuss debugging? Let me know what you think about this.

@siyutao
Copy link
Contributor Author

siyutao commented Feb 15, 2022

Hey @irisferrazzo , that sounds good to me if you and Theresa have time this week. Otherwise I can spend some time this weekend on this issue too (but I think we decided moving ELMo to 2.8 is the priority?) I'll be a lot freer from the 22nd. Thanks!

@irisferrazzo
Copy link
Contributor

irisferrazzo commented Feb 15, 2022

Hi @siyutao, yes you're right, but I want to respect the fact that you have many exams :) let's see if @TheresaSchmidt can this week or not. If not and you have some time this week/weekend to run and debug together instead of working on moving ELMo to 2.8, that would be obv better! Just don't want to put pressure on anybody :)

@TheresaSchmidt
Copy link
Contributor

This is really not the error I would expect from changing label_encoding. I would suspect an underlying that influences the different errors for BIOUL and BIO, respectively. But I'm really just guessing, too.
I'll have a look but this looks very much like the type of error that I got stuck with before, i.e. when I was working on joint learning.

Also, this week I'm still pretty busy but next week should be better.

@TheresaSchmidt
Copy link
Contributor

We've narrowed down the issue to the data. Somehow, with part of the data, the training runs through just fine (I tried with the German data and with cropped versions of the English data) but the full English data triggers the above error.

I did a superficial search for white-space irregularities (have had problems with that before) but couldn't find anything. We could also try to look for gaps in the data. Maybe there's a line where not all columns are filled.

@TheresaSchmidt
Copy link
Contributor

We could also try to look for gaps in the data. Maybe there's a line where not all columns are filled.

Haven't found anything.

@TheresaSchmidt
Copy link
Contributor

If I use the attached file as training data, I get a dimension error (like the one above but with different numbers). If I split up the file into two separate files, each of the two files trains successfully.

train_1222211.txt
This files contains one recipe from the English training data.

@irisferrazzo
Copy link
Contributor

This last file actually has a last white line. Does it work though, right? I can have a look at the data now. I let you know if I find something

@irisferrazzo
Copy link
Contributor

I have tried yesterday to run the elmo tagger but I still don't have access to proj/cookbook (for the elmo weights etc., which I would prefer not to download). Could you also run it on the same data if you get to it? Then we doublecheck

@TheresaSchmidt
Copy link
Contributor

Here's a technical explanation why it's not working: allenai/allennlp#2851

However, this does not explain why it used to run through without a problem and suddenly doesn't do so anymore even though we haven't changed anything...

@TheresaSchmidt
Copy link
Contributor

I have tried yesterday to run the elmo tagger but I still don't have access to proj/cookbook (for the elmo weights etc., which I would prefer not to download). Could you also run it on the same data if you get to it? Then we doublecheck

Training with elmo runs as expected. Confirming that the problem is with the tokenization in bert.

@irisferrazzo
Copy link
Contributor

It seems like we need to change the way BERT embeds the recipes. The most quoted solutions is the addition of a sliding window allenai/allennlp#2537

@TheresaSchmidt
Copy link
Contributor

It seems like we need to change the way BERT embeds the recipes. The most quoted solutions is the addition of a sliding window allenai/allennlp#2537

a) The feature for sliding windows only came after allennlp0.8, right? So it would probably be quite an effort to implement it.
b) We've prioritized moving elmo to 2.8.
Therefore I suggest to let this issue be (for now) and keep it in mind because I would expect the same error with bert in 2.8.

@siyutao
Copy link
Contributor Author

siyutao commented Feb 23, 2022

only came after allennlp0.8

Isn't the current implementation on allennlp 0.8.4? According to the release notes, 0.8.4. happens to be the release that added #2537. I already re-implemented the BERT in 2.8 and there wasn't an error with this but there was problem replicating the previously reported results as we've talked about.

But agreed that we should prioritize moving ELMo.

@TheresaSchmidt
Copy link
Contributor

Ah ok. Then let's just postpone this, I think.

@TheresaSchmidt
Copy link
Contributor

We are sure that we haven't changed the data or the configuration of the model. This leaves only two possible factors that could have changed s.t. training the model doesn't work anymore (correct me if I missed something):

  1. The environment at /proj/irtg.shadow/conda/envs/allennlp might have been updated.
  2. The pre-trained bert-base-multilingual-cased could have changed. It seems the model is updated regularly. In general, it is downloaded once and then a locally stored version is used each time you're training a new model. However, it is possible that the local version either gets updated sometimes and / or that it was lost at some point in time and a newer version was downloaded instead.

@irisferrazzo irisferrazzo added wontfix This will not be worked on and removed help wanted Extra attention is needed labels Jun 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants