-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can we use model BertForTokenClassification for lengthy sentences? #77
Comments
Hi @Swty13 , the pretrained models embedding weights has been set to 512 you can refer in following code try change the below line in following code in run_ner.py(line no 216) and bert.py (line no 79) iam having the same issue i will try to find solution if you find the solution please let me know Best |
Hi, (Currently I have VM server of 32GB and 64GB what configuration should I choose or GPU is must to train BERT model, I am new to BERT so I have no idea about it.) Thnaks |
@Swty13
Cons - For your second question - You can always train your model on CPU with 1 lakh dataset (I am assuming sentences) on the said RAM. |
There is no way to do it without splitting the input, that is because Google released the pretrained version of best with the 512 limitation, and to remove that limitation you would basically have to pretrain bert from scratch, which is unfeasible and would cost lots of money. i solved it by splitting the input in less then 500 tokens each, always splitting at the closest period.
edit, i use 500 instead of 512 just because when i use 512 I sometimes still get an error for some reason, possibly because of additional tokens added by the model itseld |
Hi,
As BERT tokenization only supports tokenization of sentence upto 512 so if my text length is greater than 512 How can I proceed?
I used BertForTokenClassification for entity recognition task but because of my text length is large a warning comes -- "Token indices sequence length is longer than the specified maximum sequence length for this BERT model (527 > 512). Running this sequence through BERT will result in indexing errors".
I don't want to trim or truncate my text as it lost the important information I have to pass my whole text.
Could you plz suggest my what should I do or Do you have any other idea to implement named entity recognition.
Thanks in advance.
The text was updated successfully, but these errors were encountered: