Annotated sentences or whole text documents? #168

vasilikivmo · 2018-01-12T14:20:42Z

I have a large amount of texts which I want to test through the python examples. My question is: The annotated data I am going to provide, does it have to be split up into sentences? Do I have the ability to pass a whole text with multiple sentences, paragraphs and lines, and thus, the range numbers of each entity be based on the word counting of the whole document and not each sentence separately?

Based on the comment of the inner code (for example on train_ner.py):
"When you train a named_entity_extractor you need to get a dataset of sentences (or sentence or paragraph length chunks of text) where each sentence is annotated with the entities you want to find." I would like to make clearer the part on the paragraph length chunks of text.

Thank you in advance.

davisking · 2018-01-12T17:22:45Z

You can break the data up into any chunk size you want. It doesn't have to be sentences.

vasilikivmo · 2018-01-28T14:20:06Z

Thank you for your reply. Does this format affect the duration of the execution?

I am struggling a lot running this, as it takes more than one hour on only a couple of annotated files, and as a result, I am not able to add more (ideally I would like to train around 1500 annotation files).

I run this on an Ubuntu VM with 2-cores CPU (initializing 2 threads). If you need me to give more information please let me know.

Thank you!

davisking · 2018-01-28T14:31:07Z

It shouldn't matter much.

There is a certain fixed runtime that is a function of how difficult your data is. This fixed runtime can be large if your dataset is labeled in an inconsistent or difficult way. So adding more data shouldn't be a problem in terms of runtime. This is all assuming your machine isn't using a 10 year old CPU and running out of RAM and constantly swapping to disk or something like that. It's possible that your computer just sucks. But other than that you shouldn't worry about it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Annotated sentences or whole text documents? #168

Annotated sentences or whole text documents? #168

vasilikivmo commented Jan 12, 2018 •

edited

Loading

davisking commented Jan 12, 2018 via email

vasilikivmo commented Jan 28, 2018

davisking commented Jan 28, 2018

Annotated sentences or whole text documents? #168

Annotated sentences or whole text documents? #168

Comments

vasilikivmo commented Jan 12, 2018 • edited Loading

davisking commented Jan 12, 2018 via email

vasilikivmo commented Jan 28, 2018

davisking commented Jan 28, 2018

vasilikivmo commented Jan 12, 2018 •

edited

Loading