You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a large amount of texts which I want to test through the python examples. My question is: The annotated data I am going to provide, does it have to be split up into sentences? Do I have the ability to pass a whole text with multiple sentences, paragraphs and lines, and thus, the range numbers of each entity be based on the word counting of the whole document and not each sentence separately?
Based on the comment of the inner code (for example on train_ner.py):
"When you train a named_entity_extractor you need to get a dataset of sentences (or sentence or paragraph length chunks of text) where each sentence is annotated with the entities you want to find." I would like to make clearer the part on the paragraph length chunks of text.
Thank you in advance.
The text was updated successfully, but these errors were encountered:
Thank you for your reply. Does this format affect the duration of the execution?
I am struggling a lot running this, as it takes more than one hour on only a couple of annotated files, and as a result, I am not able to add more (ideally I would like to train around 1500 annotation files).
I run this on an Ubuntu VM with 2-cores CPU (initializing 2 threads). If you need me to give more information please let me know.
There is a certain fixed runtime that is a function of how difficult your data is. This fixed runtime can be large if your dataset is labeled in an inconsistent or difficult way. So adding more data shouldn't be a problem in terms of runtime. This is all assuming your machine isn't using a 10 year old CPU and running out of RAM and constantly swapping to disk or something like that. It's possible that your computer just sucks. But other than that you shouldn't worry about it.
I have a large amount of texts which I want to test through the python examples. My question is: The annotated data I am going to provide, does it have to be split up into sentences? Do I have the ability to pass a whole text with multiple sentences, paragraphs and lines, and thus, the range numbers of each entity be based on the word counting of the whole document and not each sentence separately?
Based on the comment of the inner code (for example on train_ner.py):
"When you train a named_entity_extractor you need to get a dataset of sentences (or sentence or paragraph length chunks of text) where each sentence is annotated with the entities you want to find." I would like to make clearer the part on the paragraph length chunks of text.
Thank you in advance.
The text was updated successfully, but these errors were encountered: