Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable parallelism for huggingface/tokenizers #233

Open
fgeeri opened this issue Nov 18, 2022 · 0 comments
Open

Disable parallelism for huggingface/tokenizers #233

fgeeri opened this issue Nov 18, 2022 · 0 comments

Comments

@fgeeri
Copy link

fgeeri commented Nov 18, 2022

When using the transformers-based model de_dep_news_trf I get a huggingface/tokenizers warning message in the console:

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

It appears that the underlying parallelism issue in huggingface/transformers might be getting in the foreseeable future, as there was a commit that seems to address this issue just this week.

However, for the time until this will be released in transformers and spaCy, is there a way to set the mentioned environment variable when using spaCy through spacyr? The warning is printed in the console repeatedly until the R session is restarted, which is a nuisance. Setting spacy_tokenize(x, multithread = FALSE) does not influence the warning.

Click for details and instructions on how to reproduce the warning

The warning message appears when using spacy_tokenize(x, what = "sentence"), but does not show up when using what = "words". The message is printed as black text like console output, not as blue text like normal R warnings.

The message seems to be printed again and again repeatedly, but not very frequently, maybe once a minute. The message keeps appearing after I've called spacy_finalize(). Only restarting the R session stops the warning. Setting the multithread argument in spacy_tokenize does not influence whether the warning appears.

I can consistently reproduce the warning by executing the following code and then saving the code file in RStudio (the warning only appears on saving).

library(spacyr)

text_taxi <- "Franz jagt im komplett verwahrlosten Taxi quer durch Bayern. Franz jagt im komplett verwahrlosten Taxi."

spacy_initialize(model = "de_dep_news_trf")

spacy_tokenize(text_taxi, 
               what = "sentence", 
               multithread = FALSE,
               output = "data.frame")[,2]

spacy_finalize()

# Now save the code file
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant