Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotated sentences or whole text documents? #168

Open
vasilikivmo opened this issue Jan 12, 2018 · 3 comments
Open

Annotated sentences or whole text documents? #168

vasilikivmo opened this issue Jan 12, 2018 · 3 comments

Comments

@vasilikivmo
Copy link

vasilikivmo commented Jan 12, 2018

I have a large amount of texts which I want to test through the python examples. My question is: The annotated data I am going to provide, does it have to be split up into sentences? Do I have the ability to pass a whole text with multiple sentences, paragraphs and lines, and thus, the range numbers of each entity be based on the word counting of the whole document and not each sentence separately?

Based on the comment of the inner code (for example on train_ner.py):
"When you train a named_entity_extractor you need to get a dataset of sentences (or sentence or paragraph length chunks of text) where each sentence is annotated with the entities you want to find." I would like to make clearer the part on the paragraph length chunks of text.

Thank you in advance.

@davisking
Copy link
Contributor

davisking commented Jan 12, 2018 via email

@vasilikivmo
Copy link
Author

Thank you for your reply. Does this format affect the duration of the execution?

I am struggling a lot running this, as it takes more than one hour on only a couple of annotated files, and as a result, I am not able to add more (ideally I would like to train around 1500 annotation files).

I run this on an Ubuntu VM with 2-cores CPU (initializing 2 threads). If you need me to give more information please let me know.

Thank you!

@davisking
Copy link
Contributor

It shouldn't matter much.

There is a certain fixed runtime that is a function of how difficult your data is. This fixed runtime can be large if your dataset is labeled in an inconsistent or difficult way. So adding more data shouldn't be a problem in terms of runtime. This is all assuming your machine isn't using a 10 year old CPU and running out of RAM and constantly swapping to disk or something like that. It's possible that your computer just sucks. But other than that you shouldn't worry about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants